Tal-effector assembly platform, customized services, kits and assays

ABSTRACT

The invention generally relates to compositions and methods for designing and producing functional DNA binding effector molecules and associated customized services, tool kits and functional assays. In some aspects, the invention provides methods and tools for efficient assembly of customized TAL effector molecules. Furthermore, the invention relates to uses of TAL effector molecules and functional evaluation of such TAL by, for example, customized assays.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a divisional of U.S. patent application Ser. No.15/951,938 filed Apr. 12, 2018, which is a divisional of U.S. patentapplication Ser. No. 14/811,363, filed Jul. 28, 2015, now abandoned,which is continuation of U.S. patent application Ser. No. 13/856,978,filed Apr. 4, 2013, now abandoned, which claims the benefit of priorityto U.S. Provisional Patent Application No. 61/784,658 filed Mar. 14,2013; U.S. Provisional Patent Application No. 61/644,975 filed May 9,2012 and U.S. Provisional Patent Application No. 61/620,228 filed Apr.4, 2012, which disclosures are herein incorporated by reference in theirentirety.

SEQUENCE LISTING

The instant application contains a Sequence Listing which has beensubmitted in ASCII format via EFS-Web and is hereby incorporated byreference in its entirety. Said ASCII copy, created on Nov. 11, 2019, isnamed LT00652DIV_SL.txt and is 189,133 bytes in size.

FIELD OF THE INVENTION

The invention generally relates to compositions and methods fordesigning and producing functional DNA binding effector molecules andassociated customized services, tool kits and functional assays. In someaspects, the invention provides methods and tools for efficient assemblyof customized TAL effector molecules. Furthermore, the invention relatesto uses of TAL effector molecules and functional evaluation of such TALby, for example, customized assays.

BACKGROUND

Transcription activator-like (TAL) effectors represent a class of DNAbinding proteins secreted by plant-pathogenic bacteria of the species,such as Xanthomonas and Ralstonia, via their type III secretion systemupon infection of plant cells. Natural TAL effectors specifically havebeen shown to bind to plant promoter sequences thereby modulating geneexpression and activating effector-specific host genes to facilitatebacterial propagation (Römer, P., et al., Plant pathogen recognitionmediated by promoter activation of the pepper Bs3 resistance gene.Science 318, 645-648 (2007); Boch, J. & Bonas, U. Xanthomonas AvrBs3family-type III effectors: discovery and function. Annu. Rev.Phytopathol. 48, 419-436 (2010); Kay, S., et al. U. A bacterial effectoracts as a plant transcription factor and induces a cell size regulator.Science 318, 648-651 (2007); Kay, S. & Bonas, U. How Xanthomonas typeIII effectors manipulate the host plant. Curr. Opin. Microbiol. 12,37-43 (2009).) Natural TAL effectors are generally characterized by acentral repeat domain and a carboxyl-terminal nuclear localizationsignal sequence (NLS) and a transcriptional activation domain (AD). Thecentral repeat domain typically consists of a variable amount of between1.5 and 33.5 amino acid repeats that are usually 33-35 residues inlength except for a generally shorter carboxyl-terminal repeat referredto as half-repeat. The repeats are mostly identical but differ incertain hypervariable residues. DNA recognition specificity of TALeffectors is mediated by hypervariable residues typically at positions12 and 13 of each repeat—the so-called repeat variable diresidue (RVD)wherein each RVD targets a specific nucleotide in a given DNA sequence.Thus, the sequential order of repeats in a TAL protein tends tocorrelate with a defined linear order of nucleotides in a given DNAsequence. The underlying RVD code of some naturally occurring TALeffectors has been identified, allowing prediction of the sequentialrepeat order required to bind to a given DNA sequence (Boch, J. et al.Breaking the code of DNA binding specificity of TAL-type III effectors.Science 326, 1509-1512 (2009); Moscou, M. J. & Bogdanove, A. J. A simplecipher governs DNA recognition by TAL effectors. Science 326, 1501(2009)). Further, TAL effectors generated with new repeat combinationshave been shown to bind to target sequences predicted by this code. Ithas been shown that the target DNA sequence generally start with a 5′thymine base to be recognized by the TAL protein.

The modular structure of TALs allows for combination of the DNA bindingdomain with effector molecules such as nucleases. In particular, TALeffector nucleases allow for the development of new genome engineeringtools known.

Zinc-finger nucleases (ZFN) and meganucleases are examples of othergenome engineering tools. ZFNs are chimeric proteins consisting of azinc-finger DNA-binding domain and the a nuclease domain. One example ofa nuclease domain is the non-specific cleavage domain from the type IISrestriction endonuclease FokI (Kim, Y G; Cha, J., Chandrasegaran, S.Hybrid restriction enzymes: zinc finger fusions to Fok I cleavage domainProc. Natl. Acad. Sci. USA. 1996 Feb. 6; 93(3):1156-60) typicallyseparated by a linker sequence of 5-7 bp. A pair of the FokI cleavagedomain is generally required to allow for dimerization of the domain andcleavage of a non-palindromic target sequence from opposite strands. TheDNA-binding domains of individual Cys2His2 ZFNs typically containbetween 3 and 6 individual zinc-finger repeats and can each recognizebetween 9 and 18 base pairs.

One problem associated with ZNFs is the possibility of off-targetcleavage which may lead to random integration of donor DNA or result inchromosomal rearrangements or even cell death which still raises concernabout applicability in higher organisms (Zinc-finger Nuclease-inducedGene Repair With Oligodeoxynucleotides: Wanted and Unwanted Target LocusModifications Molecular Therapy vol. 18 no. 4, 743-753 (2010)).

Another group of genomic engineering proteins are sequence-specific rarecutting endonucleases with recognition sites exceeding 12 bp—so-calledmeganucleases or homing endonucleases. The large DNA recognition sitesof 12 to 40 base pairs usually occur only once in a given genome andmeganucleases (such as, e.g., I-SceI) are therefore considered the mostspecific restriction enzymes in nature and have been used to modify allsorts of genomes from plants or animals. One example of a meganucleaseis PI-SceI, which belongs to the LAGLIDADG (SEQ ID NO: 233) family ofhoming endonucleases. However, the repertoire of naturally occurringmeganucleases is limited and decreases the probability of finding aspecific enzyme for a defined genomic target sequence. Meganucleases aretherefore engineered to modify their recognition sequence. To developtailored meganucleases with new recognition sites, two main approacheshave been adopted: random mutagenesis of residues in the binding domainand subsequent selection of functional variants or fusing other enzymedomains to meganuclease half-sites to create chimeric meganucleases.

There is a need to improve these tools to (1) make them more flexibleand reliable, (2) develop new means to predict and rationally design newbinders, (3) tailor and modify effector activities and (4) efficientlyassemble, test and deliver the engineered molecules.

SUMMARY OF THE INVENTION

The invention relates to compositions and methods which may be used forgenetic engineering and altering the structure and/or function ofnucleic acid molecules (e.g., nucleic acid molecules located withincells). In some aspects, the invention relates, in part, to compositionsand methods for in vivo genetic manipulation (e.g., involving homologousrecombination) and the alteration of gene expression (e.g., geneactivation, repression, modulation, etc.).

Furthermore, the invention includes methods, compositions and tools todesign and efficiently assemble nucleic acid molecules. In particularthe described methods and vectors are useful for assembling nucleic acidmolecules encoding TAL effectors and TAL effector fusion proteins butcan also be used to assemble other nucleic acid sequences encodingcomplex or modular protein functions or fusions.

In some embodiments, the invention includes linear nucleic acidmolecules (e.g., linear vectors), such as those comprising one or more(e.g., two or more, three or more or all four) of the following: (a) aregion encoding an N-terminal portion of a TAL effector, (b) a regionencoding a C-terminal portion of a TAL effector, (c) at least onerecombination site, and (d) at least one covalently bound topoisomerase,as well as methods for producing and using such nucleic acid molecules.In many instances, the topoisomerase will be located at one or both ofthe termini of the linear nucleic acid molecule. Also, in many instance,the covalently bound topoisomerase will be located within 100 (e.g.,from about 2 to about 90, from about 5 to about 90, from about 10 toabout 90, from about 15 to about 90, from about 20 to about 90, fromabout 25 to about 90, from about 30 to about 90, from about 2 to about40, from about 5 to about 40, from about 10 to about 40, from about 15to about 40, from about 2 to about 25, from about 5 to about 25, etc.)nucleotides of a recombination site.

In some more specific embodiments, the invention includes linear nucleicacid molecules, such as those comprising: (a) a region encoding anN-terminal portion of a TAL effector, (b) a region encoding a C-terminalportion of a TAL effector, and (c) at least one covalently boundtopoisomerase, as well as methods for producing and using such nucleicacid molecules.

In many instances, linear nucleic acid molecules of the invention willhave a sequence which is complementary to a sequence generated by a TypeIIS restriction endonuclease. Thus, the invention also includes methodsfor generating one or more nucleic acid segments which contain overhangsat one or both termini generated by digestion with a Type IISrestriction endonuclease, followed by contacting one or more nucleicacid segments with one or more linear nucleic acid molecules underconditions which allow for covalent joining of the digested nucleic acidsegment with the one or more linear nucleic acid molecules.

Linear nucleic acid molecules, such as those described above, may becircularized. In many instances, circularization will result in theaddition of nucleic acid to the linear nucleic acid molecules (e.g., theaddition of an “insert”). Further, in some instances when the nucleicacid molecules are circularized and contain TAL repeats (i.e., more thanone TAL nucleic acid binding cassette) located between the termini ofthe linear nucleic acid molecules, the circularized nucleic acidmolecules will encode TAL effectors capable of binding to specifiednucleic acid sequences. In other instances, the circularized nucleicacid molecules (e.g., vectors) may contain coding sequences for one ormore component of TAL effectors.

When a vector, or other nucleic acid molecule, is circularized, it maybe circularized by covalently linkage of one or both strands of one orboth ends by the action of a ligase or topoisomerase. Further, acircularized vector may be contains one or more nicks in one or bothstrands. As an example, nicks may be located in one strand at one orboth junctions where an insert is added to the vector. The presence ofone or more nicks will generally result in a relaxed supercoil structureof the circularized vector. Further, nicks may be repaired in vitro viathe use of, for example, ligases. Nicks may also be repaired in vivo(within a cell) via cellular repair mechanisms.

In many instance, linear nucleic acid molecules of the invention will bemolecules such as vectors. Thus, linear nucleic acid molecules of theinvention may contain one or more origin of replication. Such origins ofreplication may allow for replication in particular cell types, such asprokaryotic cells (e.g., Escherichia coli, Synechococcus species, etc.)and eukaryotic cells (e.g., Chlamydomonas reinhardtii, human cells,mouse cells, sf9 cells, etc.).

Further, linear nucleic acid molecules of the invention may comprise oneor more recombination site. In some instances, such recombination sitesare selected from the group consisting of (a) att sites (e.g., attB,attP, attL, and attR sites), (b) lox sites (e.g., loxP, loxP511, etc.),and (c) frt sites.

Topoisomerases suitable for use with the invention vary greatly but willtypically have the ability to covalently join at least one strand of twonucleic acid termini. Thus, linear nucleic acid molecules of theinvention may comprise at least one covalently bound topoisomerase whichis a Type IA, Type IB, Type IIA, and/or Type II topoisomerase. In someinstances, the covalently bound topoisomerase is a Vaccinia virustopoisomerase. The invention also includes methods for generating linearnucleic acid molecules with one or more covalently bound topoisomerase.

Linear nucleic acid molecule of the invention may have two blunt terminior an overhang (e.g., a 5′ and/or a 3′ overhang) on at least oneterminus. Further, the lengths of overhangs, when present, may varygreatly but will often be between one and ten (e.g., from about 1 toabout 6, from about 2 to about 6, from about 3 to about 6, from about 1to about 4, etc.) nucleotides in length.

In some specific instances, the overhang on linear nucleic acidmolecules of the invention will be a single thymine or uridine.Typically, such overhangs will be 5′ overhangs present on one or bothtermini. Termini such as this will often be useful in what is referredto as TA cloning. TA cloning makes use of the generation of a polymerasechain reaction (PCR) product produced using a non-proofreadingpolymerase having a tendency to leave 3′ terminal adenines at thetermini of the resulting PCR products. Thus, the invention includes theuse of linear nucleic acid molecules of the invention in TA cloningprocedures.

The invention also includes methods for preparing TAL effectorlibraries, as well as the libraries themselves and methods for usingsuch libraries. In some aspects, such methods comprise (a) connecting apopulation of TAL nucleic acid binding cassettes encoding TAL subunitsthat individually bind adenine, guanine, thymidine, or cytosine basebinders, when the base is present in a nucleic acid molecule (e.g., togenerate a TAL repeat) and (b) introducing the connected TAL nucleicacid binding cassettes (e.g., a TAL repeat) generated in (a) into avector to generate a TAL effector library. Such libraries will oftenencode TAL effectors which bind to different nucleotide sequences.

The AT/CG ratio of nucleic acids in differs between organisms, withinthe genome of the same organism, and in different locations within agenome of the same organism. For example, a eukaryotic organism willoften have a different AT/CG ratio in nucleic acid which forms thenuclear genome and the mitochondrial genome. Thus, when generating a TALeffector library designed to bind to nucleic acid of (1) a particulargenome or (2) a region or regions (e.g., promoter regions) of aparticular genome, the nucleic acid binding site may be “biased” towardsthe generation of binding domains for the desired target. Thus, theinvention includes method for generating TAL effector libraries whereinTAL nucleic acid binding cassettes that encode adenine, guanine,thymidine, and cytosine binders are either all present in equimolaramounts or not all present in equimolar amounts. In specific instances,TAL nucleic acid binding cassettes that encode adenine and thyminebinders are present in equimolar amounts and represent from about 51% toabout 75% (e.g., from about 51% to about 70%, from about 51% to about65%, from about 51% to about 60%, from about 51% to about 55%, fromabout 55% to about 65%, etc.) of the total TAL nucleic acid bindingcassettes present. In other specific instances, TAL nucleic acid bindingcassettes that encode cytosine and guanine binders are present inequimolar amounts and represent from about 51% to about 75% (e.g., fromabout 51% to about 70%, from about 51% to about 65%, from about 51% toabout 60%, from about 51% to about 55%, from about 55% to about 65%,etc.) of the total TAL nucleic acid binding cassettes present. Thus, theinvention includes methods for generating TAL effector librariescomprising TAL nucleic acid binding cassettes with nucleic acidrecognition having an AT/CG ratio biased in favor of the genome orregion of a genome for which binding activity is sought.

In some instances, these TAL effector libraries will not be bound to afusion partner but nucleic acid binding activity can be assessed and afusion partner, if desired, can be added later to form a TAL effectorfusion. In many instances, TAL effector libraries of the invention willencode TAL effector fusions.

In some instances, TAL effector fusions of the invention will havetranscriptional activation activity. In other instances, TAL effectorfusions of the invention will inhibit transcription. Transcriptionalinhibition may be conferred by a number of different mechanisms,including blocking of a binding site for transcriptional activators.

Vector suitable for use in compositions of the invention include viralvectors (e.g., lentiviral vectors, adenoviral vectors, etc.).

The invention also includes methods for identifying TAL effectors thatbind to specified nucleotide sequences, as well as TAL effectorsidentified by such methods. In some instances, such methods comprise (a)connecting a population TAL nucleic acid binding cassettes whichindividually encode TAL repeats that bind to one of the bases adenine,guanine, thymidine, and cytosine, when the base is present in a nucleicacid molecule, (b) introducing the connected TAL nucleic acid bindingcassettes generated in (a) into a vector to generate a TAL effectorlibrary, wherein the library contains TAL effectors which bind todifferent nucleotide sequences, (c) introducing the TAL effector libraryinto a cell under conditions which allow for the expression of TALeffectors, and (d) screening the cells generated in (c) to identifycells in which at least one cellular parameter is altered by expressionof a TAL effector. In some instances, the cellular parameter is TALeffector induced transcriptional activation of a non-TAL effector gene.Further, cells used in the practice of this aspect of the invention maycontain nucleic acid comprising a promoter operably linked to a reporter(e.g., lacZ, green fluorescent protein, etc.). Such cells may be used inmethods wherein the cellular parameter is transcriptional activation ofthe reporter.

The invention further includes novel transcription activator-like (TAL)repeats and TAL repeat amino acid sequences, as well as other componentsof TAL proteins. As described herein, TAL homologs were identified byamino acid sequence based bioinformatic searches using known TAL aminoacid sequences. Once a prospective TAL protein was identified, the aminosequence of the proteins was then analyzed for TAL repeats and otherfeatures.

In many instances, proteins which contain TAL repeats described hereinwill bind nucleic acid (e.g., DNA in a sequence specific manner). Assaysfor measuring sequence specific nucleic acid binding activity andcharacteristics of such proteins are described elsewhere herein.

In some aspects, provided herein are embodiments of further TAL repeatstructures, including TAL effector (TALE) molecules containing repeatsequences, as well as further amine- and carboxyl-terminal sequencesflanking a repeated region. These further TAL repeats as well as theflanking regions, independently can be incorporated into TAL fusionproteins, nucleic acids encoding such fusion proteins. The inventionfurther includes vectors comprising the nucleic acids encoding TALrepeats and TAL fusion proteins, host cells comprising the vectors, andkits containing for practicing various embodiments of the invention.

The invention includes, in part, non-naturally occurring proteins (e.g.,fusion proteins such as non-naturally occurring fusion proteins) whichcontain one or more TAL repeats (e.g., TAL repeats with sequencespecific nucleic acid binding activity). In the some embodiments suchnon-naturally occurring proteins comprising (a) an amine terminal region(e.g., an amine terminal region of between from about 25 and to about500 amino acids, from about 50 and to about 500 amino acids, from about75 and to about 500 amino acids, from about 100 and to about 500 aminoacids, from about 150 and to about 500 amino acids, from about 50 and toabout 250 amino acids, etc.), (b) a carboxyl terminal region of between25 and 500 amino acids (e.g., a carboxyl terminal region of between fromabout 25 and to about 500 amino acids, from about 50 and to about 500amino acids, from about 75 and to about 500 amino acids, from about 100and to about 500 amino acids, from about 150 and to about 500 aminoacids, from about 50 and to about 250 amino acids, etc.), and (c) acentral region containing five or more (e.g., from about 5 to about 25,from about 5 to about 20, from about 5 to about 18, from about 10 toabout 30, from about 10 to about 25, from about 15 to about 25, etc.)amino acid segments which confer upon the non-naturally occurringprotein sequence specific nucleic acid binding activity. In someembodiments, all of or one or more of the individual amino acid segmentsin which form the central region are between from about 30 and to about38 amino acids, from about 30 and to about 37 amino acids, from about 30and to about 36 amino acids, from about 30 and to about 35 amino acids,from about 33 and to about 35 amino acids, etc., in length.

The amino acid segments which form the central region may contain one ormore amino acid sequence at least 80%, 85%, 90%, 95%, or identical toone or more of the following amino acid sequences: (1) FSQADIVKIAGN (SEQID NO:37), (2) GGAQALQAVLDLEP (SEQ ID NO:38), (3) GGAQALQAVLDLEPALRERG(SEQ ID NO:39), (4) FRTEDIVQMVS (SEQ ID NO:40), (5) GGSKNLAAVQA (SEQ IDNO:41), (6) GGSKNLEAVQA (SEQ ID NO:42), (7) LEPKDIVSIAS (SEQ ID NO:43),(8) GATQAITTLLNKW (SEQ ID NO:44), (9) GATQAITTLLNKWDXLRAKG (SEQ IDNO:45), and (10) GATQAITTLLNKWGXLRAKG (SEQ ID NO:46). In some instances,X in the above sequences may independently be one of the following aminoacids: aspartic acid, serine, alanine, or glutamic acid. The inventionalso includes peptides and proteins which comprise the above amino acidsequences, as well as nucleic acid molecules which encode such aminoacid sequences.

A number of TAL proteins are known in the art. Thus, in some specificaspects, the invention does not include proteins which are in the priorart.

In many instances, proteins of the invention and, in appropriateinstances, subcomponents thereof are not identical to an amino acidsequence of a TAL protein which naturally occurs in a bacterium of thegenera Burkholderia, Xanthomonas or Ralstonia. In specific embodiments,the invention does not include non-naturally occurring proteins in whichat least one (e.g., one, two, three, four, five, six, etc. or all) ofthe amino acid segments is identical to an amino acid sequence of a TALprotein which naturally occurs in a bacterium of the generaBurkholderia, Xanthomonas or Ralstonia. In some instances, the inventiondoes not include one or more amino acid segments identical to an aminoacid sequence of a TAL protein, with the exception of the RVD sequence,which naturally occurs in a bacterium of one of the more of the generaBurkholderia, Xanthomonas or Ralstonia.

In additional specific embodiments, the invention does not includenon-naturally occurring proteins comprising at least one (e.g., one,two, three, four, five, six, etc. or all) amino acid segment identicalto either an amino acid sequence shown in FIG. 30 or one of the firsteighteen amino acid sequences shown in FIG. 30. However, in someembodiments, the invention does include proteins which contain anysequence shown in FIG. 30 (as well as any other amino acid sequencefound in nature or otherwise known) in combination with other sequences(e.g., TAL repeat sequences provided herein).

The invention also includes non-naturally occurring proteins (e.g.,fusion proteins) comprising a region containing five or more amino acidsegments (in some instances, collectively referred to as “TAL repeats”)which confer upon the non-naturally occurring protein sequence specificnucleic acid binding activity. In some instances, each of the five ormore amino acid segments has a length of 32-35 amino acids (e.g., someamino acid segments having a length of 33 amino acids and others havinga length of 35 amino acids). In additional instances, at least one ofthe of the five or more amino acid segments has isoleucine residue atposition 6. In further additional instances, amino acid 12 or aminoacids 12 and 13 of at least one of the of the five or more amino acidsegments confers upon the renders the amino acid segment the ability torecognize a single base in a nucleic acid molecule. In some instances,at least one of the five or more amino acid segments comprises at aminoacid positions 14-19 an amino acid sequence having at least 80%, 85%,90%, or 95% identical to an amino acid sequence selected from the groupconsisting of: (a) GG(A or T)Q(A or T)L (SEQ ID NO: 82), (b) GGSKNL (SEQID NO: 83), and (c) GA(T or N)(N or K)(A or T)I (SEQ ID NO: 84).

In additional embodiments, non-naturally occurring proteins (e.g., afusion proteins) of the invention, as well as individual TAL repeats ofthe invention, comprises at amino acid positions 14-23 of at least oneTAL repeat a sequence having at least 80%, 85%, 90%, or 95% identical toGGAQALX₁X₂VLL (SEQ ID NO: 85), where X₁ and X₂ are independently any ofthe twenty of the commonly occurring amino acids found in proteins. Insome instances, X₁ and X₂ are not E or G. In additional embodiments,non-naturally occurring proteins of the invention comprise at amino acidpositions 14-19 at least one TAL repeat sequence having at least 80%identical to GGAQAL (SEQ ID NO: 86).

As described elsewhere herein, the invention includes non-naturallyoccurring proteins which are fusion proteins. In many instances, thesenon-naturally occurring fusion proteins comprise a sequence specificnucleic acid binding activity and at least a second activity other thansequence specific nucleic acid binding activity. In certain embodiments,the second activity may be one of the following: an activator activity(e.g., a transcriptional activation activity), a repressor activity(e.g., a transcriptional repression activity), a nuclease activity, atopoisomerase activity, a gyrase activity, a ligase activity, aglycosylase activity, an acetylase activity, a deacetylase activity, anintegrase activity, a transposase activity, a methylase activity, ademethylase activity, a methyl-transferase activity, a kinase activity,a recombinase activity, a phosphatase activity, a sulphurilase activity,a polymerase activity, a fluorescent activity.

In some instances, the second activity is a nuclease activity. Further,such nuclease may comprise a FokI nuclease cleavage domain, a FokInuclease cleavage domain mutant KKR Sharkey, or a FokI nuclease cleavagedomain mutant ELD Sharkey. Further, the second activity may beconferred, for example, by a VP16, VP32 or VP64 transcriptionalactivator domain(s) or a KRAB transcriptional repressor domain.

The invention also comprises nucleic acid molecules (e.g., vectors)which encode proteins described herein, as well as host cells comprisingsuch nucleic acid molecules.

The invention further comprises methods of regulating expression of atarget gene. In some instances, methods of the invention comprisecontacting a cell with a nucleic acid molecule which encodes anon-naturally occurring fusion protein described herein under conditionswhich allow for intracellular expression of the non-naturally occurringfusion protein.

Alignments of protein sequences were carried out and consensus sequenceswere generated using VectorNTI Advance, version 11.5.1, using thedefault settings (Life Technologies, Carlsbad, Calif.). The softwarescores amino acids in terms of identity and in terms of similarity.Similarity is defined as set forth in TABLE 1. A “strong” designationdepicts a strong similarity while a “weak” designation depicts a weaksimilarity. Those designated as “strong” are depicted in the figures inthe same manner as identical amino acids.

TABLE 1 Amino Acid Strong Weak Amino Acid Strong Weak A GS CTV M ILV F CAS N Q DEGHKRST D E GHKNQRS P ST E D HKNQRS Q N DEHKRS F WY HILM R KDEHNQ G A DNS S AT CDEGKNPQ H Y DEFKNQR T S AKNPV I LMV F V ILM AT K RDEHNQST W FY L IMV F Y FHW

The invention also includes methods for genomic engineering and sitespecific integration of a nucleic acid molecule of interest and variousassay formats and surrogate reporter systems to evaluate TAL effectoractivity. Also the invention provides for methods to enrich, select orisolate cells that have been modified by a TAL effector such as, e.g., aTAL effector nuclease.

Furthermore, the invention comprises methods to fine-tune the activityof TAL effector proteins in a target host cell.

The invention may be more fully understood by reference to the followingdrawings.

DESCRIPTION OF THE FIGURES

FIGS. 1A-1B and FIGS. 2A-2B show a work flow for a modular web-basedplatform for providing TAL-specific services to customers. FIG. 1A showsone example of a how a customer portal can be organized to allow forautomatic processing of orders related to TAL effectors and relatedservices. After login to the portal, the customer enters a specificnucleic acid sequence to be targeted by a requested TAL effectormolecule and provides or selects additional product specifications. Inaddition, the customer can select optional services from a menu or enterinquiries or additional comments. Product specifications and requestedservices are then processed via an internal system based on design andassembly programs and databases to identify optimal assembly criteria.TAL effector molecules are then assembled based on de novo synthesizednucleic acid molecules and/or available libraries or parts. FIG. 1Adiscloses SEQ ID NO: 115.

FIG. 1B shows an example of a possible modular organization of suchweb-based platform that includes workflows directed to TAL services. Theplatform consists at least of (i) a web interface (module 1) for inputand storage of customer- and project-specific information and forinformation exchange between customer and service provider; a DesignEngine (module 2) which integrates software and database information todetermine TAL design and generated an assembly strategy and (iii) amanufacture unit (module 3) comprising means and material to synthesize,assemble, express and analyse TAL constructs wherein at least some stepsof the workflow may be supported by a Laboratory Information ManagementSystem (LIMS).

FIG. 2A is a schematic drawing of the modular structure of arepresentative naturally occurring TAL protein. This protein is composedof an amino terminal end (N), a central array comprising a variablenumber of 34-amino acid repeats indicated by ovals with hypervariableresidues at positions 12 and 13 that determine base preference, and acarboxyl terminal end (C) comprising a nuclear localization signal (NLS)and a transcription activator (AD) domain. FIG. 2A discloses SEQ ID NO:90.

FIG. 2B shows an amino acid alignment of TAL nucleic acid bindingcassettes from five different protein. Two repeats are from the AvrXa27protein of Xanthomonas oryzae (GenBank: AAY54168.1) (SEQ ID NOS 116-117,respectively, in order of appearance), two repeats are from the Hax3protein of Xanthomonas campestris (GenBank: AAY43359.1) (SEQ ID NOS 71and 118, respectively, in order of appearance), three repeats are fromthe PSIO7 protein of Ralstonia solanacearum (YP 003750492.1) (SEQ ID NOS119-121, respectively, in order of appearance), two repeats are from theTal5a protein of Xanthomonas oryzae (GenBank: AEQ96609.1) (SEQ ID NOS122-123, respectively, in order of appearance), and three repeats arefrom the Tal11a protein of Xanthomonas oryzae (GenBank: AEQ98467.1) (SEQID NOS 124-126, respectively, in order of appearance). Identical aminoacids are shown as white letters on black background and non-identicalamino acids are shown as black letters on white background.

FIG. 3A shows the amino acid sequence of the wild-type Hax3 protein (960amino acids total) of Xanthomonas campestris (GenBank AAY43359.1) (SEQID NO: 127). The 288 amino acid region labeled “1” is the amino flankingregion of the TAL repeat region, which is labeled “2”. The 291 aminoacid carboxyl terminal flanking region of the TAL repeat region islabeled “3”. The first TAL repeat of the repeat region is shown boxed.Further, amino acids 153, 289, and 683 are labeled for reference.

FIG. 3B shows a TAL effector construct (SEQ ID NO: 128) designed to bindto the following nucleotide sequence: TACACGTTTCGTGTTCGGA (SEQ ID NO:87). The labels are as follows: (1) V5 epitope, (2) nuclear localizationsignal, (3) amino flanking region of the TAL repeat region, (4) TALrepeat region, (5) carboxyl terminal flanking region of the TAL repeatregion, (6) two amino acid linker, and (7) a wild-type FokI nucleasecoding sequence. The first TAL repeat of the repeat region is shownboxed.

FIG. 4A to 4F show examples of TAL-binding assays suitable forevaluation of TAL binding specificity in vitro. FIG. 4A is a plate assaywhere TAL effectors translated in cell-free environment are bound to DNAprobes containing a TAL target site on Nickel-coated plates. Anintercalating agent is added to allow for detection of specificallybound DNA molecules. Fluorescence read-out allows for comparison ofbinding specificity of different TAL-DNA interactions.

FIG. 4C is an alternative binding assay using paramagnetic Nickel-beads.Equal amounts of TAL protein were covalently coupled to activatedDYNABEADS® and incubated with a 5-fold molar excess (Sample 1) andequimolar amount (Sample 2) of plasmid DNA for 1 hour. Beads were thenseparated from supernatant, washed once with PBS with Mg2+ andresuspended in equal volumes H2O. Resuspended beads were further dilutedand total DNA was quantified by qPCR with plasmid specific primers andSYBR® Green against known dilutions of plasmid DNA (Std. 1 ng-Std. 0.001ng). Samples were set up and quantified in triplicates.

FIG. 4E is a Gel shift assay where target-site containing DNA is shiftedin the presence of sufficient amounts of a specifically binding TALprotein.

FIG. 4F shows an example of a TAL effector binding assay where TALbinding to a predicted binding site correlates with an increase influorescent signals. A customized TAL effector protein translated invitro (e.g., from a plasmid or PCR fragment containing a promoterregion) is incubated with a set of oligonucleotides. A firstoligonucleotide (sense) (SEQ ID NO: 129) carries a TAL binding site andterminal ends that can form a stem-loop structure. One end of saidoligonucleotide is attached to a fluorophore (star symbol) whereas theother end is attached to a quencher molecule (sphere symbol) so that thefluorophore signal is quenched in the stem-loop conformation. The secondoligonucleotide (antisense) (SEQ ID NO: 130) is designed to hybridizewith the first oligonucleotide (where labeling is optional). At least aportion of the first and second oligonucleotides in the pool are presentin an annealed open state where the fluorophore signal is not quenched.Binding of a specific TAL effector protein to the TAL binding sitestabilizes the annealed open conformation thereby shifting theequilibrium which results in a signal increase. In contrast a signalincrease cannot be measured in the absence of TAL effector binding. FIG.4D discloses SEQ ID NOS 129-130, 129-130 and 129-130, respectively, inorder of appearance.

FIG. 5 shows a method and assay to identify truncated TAL effectorvariants with binding activity. A series of DNA fragments encodingtruncated versions of TAL N- and C-termini are generated. Each 5′ DNAfragment (A-fragment) contains a nucleic acid sequence encoding atruncated TAL N-terminus and the 5′ moiety of the central repeat domainwhereas each 3′ fragment (B-fragment) contains a nucleic acid sequenceencoding a truncated TAL C-terminus and the 3′ moiety of the centralrepeat domain. The resulting length variants are equipped with terminalrestriction sites and are combined in a cleavage-ligation reaction toobtain all possible combinations of A- and B-fragments. The obtainedlength variants are inserted into a linearized target vector downstreamof a promoter allowing for expression of the variants in a host cell andupstream of a fusion domain (e.g., an activator domain) to obtain a TALeffector fusion protein. Furthermore, the target vector carries one ormore specific TAL binding sites for the TAL effector variants associatedwith a reporter gene operatively linked with a promoter region. Theresulting truncation library is then inserted into host cells allowingfor expression of the various TAL effector truncations. Only thefunctional truncated variants will bind to the TAL binding site(s) inthe vector and induce detectable reporter gene expression allowing toidentify those cells carrying a functional truncated TAL effector.Selected host cells may be replicated and nucleic acid sequences ofidentified candidates can be obtained via sequencing of the 5′ and 3′ends using specific primer binding sites in the vector flanking the TALeffector insertion site.

FIGS. 6A and 6B show various effector fusion open reading framessequence-optimized for expression in mammalian hosts: SEQ ID NO: 1:codon-optimized sequence encoding FokI nuclease cleavage domain; SEQ IDNO: 2: codon-optimized sequence encoding FokI nuclease cleavage domainmutant KKR Sharkey; SEQ ID NO: 3: codon-optimized sequence encoding FokInuclease cleavage domain mutant ELD Sharkey; SEQ ID NO: 4:codon-optimized sequence encoding VP16 activator; SEQ ID NO: 5codon-optimized sequence encoding VP64 activator composed of a tetramersequence representing the VP16 core motif; SEQ ID NO: 6: codon-optimizedsequence encoding KRAB repressor.

FIG. 7A shows one example of how a trimer repeat library can beassembled from repeat monomers. Selected building blocks from a monomerlibrary with at least one repeat for each base and variants thereofproviding individual overhangs upon type IIS cleavage are assembled inrandom combinations into a capture vector containing a counterselectable marker. Cleavage sites are indicated by numbers. Equalnumbers represent resulting compatible overhangs.

FIG. 7B illustrates the construction of a TAL effector fusion by atwo-step assembly process. Two sets of 4 trimer building blocks arrangedin capture vectors are assembled into a target vector via compatibletype IIS restriction enzyme cleavage sites thereby replacing a negativeselection marker gene. The target vector may already contain theflanking TAL effector N- and C-terminal ends and an effector fusionsequence or a multiple cloning site (MCS). FIG. 7B discloses SEQ ID NOS131-132, respectively, in order of appearance.

FIG. 7C shows target vectors for the construction of various TALeffector fusions. Examples for different effector functions are provided(VP16, activator; KRAB, repressor, FokI R and L, dimerizing nucleasecleavage domains); TAL effector nucleases active as dimers can beprovided in a vector pair with each vector comprising the sequence for anuclease monomer. Some of the vectors contain truncated versions of N-or C-terminal TAL ends described in more detail elsewhere herein.

FIGS. 8A to 8C show vector maps of different TAL Gateway entry vectors.Vector features: NLS, nuclear localization signal; Kan(R), kanamycinresistance gene; attL1 and attL2, recombination sites (allowrecombinational cloning of the gene of interest from an entry clone(Landy, A. Dynamic, structural, and regulatory aspects of lambdasite-specific recombination”. Ann. Rev. Biochem. 1989; 58:913-49.Review); pENTR-funct-vec-for and pENTR-funct-vec-rev, primer bindingsites; rrnB T1 and T2 transcription terminators (protect the cloned genefrom expression by vector-encoded promoters, thereby reducing possibletoxicity (Orosz et al. Analysis of the complex transcription terminationregion of the Escherichia coli rrnB gene. Eur. J. Biochem. 201(3):653-9(1991)); T7 promoter/priming site (allows in vitro transcription in thesense orientation and sequencing through the insert; M13 Forward (−20)priming site (allows sequencing in the sense orientation); V5 epitopeGly-Lys-Pro-Ile-Pro-Asn-Pro-Leu-Leu-Gly-Leu-Asp-Ser-Thr (SEQ ID NO: 88)(allows detection of the recombinant fusion protein by the Anti-V5antibodies (Southern et al. Identification of an epitope on the P and Vproteins of simian virus 5 that distinguishes between two isolates withdifferent biological characteristics. J Gen Virol. 72 (Pt 7):1551-7(1991)); pUC origin (allows high-copy number replication and growth inE. coli.); Gly-Ser linker (flexible peptide linker to prevent sterichindrance between domains unique restriction enzyme cleavage sites areindicated for each vector). FIG. 8A discloses SEQ ID NOS 133-134,respectively, in order of appearance. FIG. 8B shows the vector map forpENTR221 TATLTATL vp16 activator. FIG. 8C shows the vector map forpENTR221 Truncated TAL MCS.

FIG. 9A shows an alternative method of assembling TAL effector encodingnucleic acid molecules by solid phase elongation based on a TAL trimerlibrary. FIG. 9A shows the different modules required for solid-phaseassembly from trimer building blocks: 16 starter modules comprising aTAL 5′ flanking region attached to an anchor and 16 different trimermodules representing all triplet combinations starting with a T-bindingcassette, 64 elongation modules representing all triplet combinations ofA-, G-, C- and T-binding cassettes, and 64 completion modulesrepresenting all triplet combination fused to a TAL 3′ flanking region.

FIGS. 9B to 9G illustrate one possible method of assembling TAL effectorsequences on solid-phase. Following immobilization of a starter moduleon a solid phase via a molecular anchor (FIG. 9B) the module is cleavedwith a first type IIS enzyme (BsmBI in this example) to generate asingle-strand overhang at the 3′ end. A first elongation module isselected from the library and cleaved with a second type IIS enzyme(BbsI in this example) to generate an overhang at the 5′ end. Cleavedstarter module and first elongation module are then mixed and ligated onthe solid phase (FIG. 9C). After a washing step an overhang is generatedat the 3′ end of the ligation product using the first type IIS enzyme,and a second elongation module comprising a compatible 5′ overhang isadded and ligated (FIG. 9D). Digestion and ligation cycles are repeateduntil a chain of n−1 modules has been assembled (FIG. 9E). The last stepperformed on the solid phase adds a completion module which provides the3′ flanking TAL sequences (FIG. 9F). Finally the TAL effector sequenceis released from the solid phase by cleavage with a third type IISenzyme (AarI in this example) which generates the compatible ends forinsertion in a capture or functional vector (FIG. 9G).

FIG. 10A shows (1) three types of amino acid coding sequences, labeled“N-Terminus”, “Nucleic Acid Binding Repeat”, and “C-Terminus”, (2)topoisomerase adapted linear vector (lower left) and a closed circularvector. The boxes with diamond shapes in them represent recombinationsites (e.g., GATEWAY™ sites) with different recombination specificitiesindicated by “Rec. No.” Covalently bound topoisomerase proteins arerepresented at the termini of the linear vector by the closed circleconnected to the vector by a solid line. “ORI” refers to an origin ofreplication, “Pos Sel” refers to a positive selectable marker, and “NegSel” refers to a negative selectable marker.

FIG. 10B shows a linear vector containing topoisomerase covalently boundat both ends. Also shown are two types of inserts (labeled “Insert 1”(SEQ ID NOS 135-136, respectively, in order of appearance) and “Insert2” (SEQ ID NOS 135 and 137, respectively, in order of appearance), eachwith sequence identity at one end with one terminus of the vector.

FIG. 10C shows a GATEWAY™ recombination reaction series in which twonucleic acid segments (labeled “DNA-A” and “DNA-B”) are introduced intoa circular nucleic acid molecule while a ccdB (or alternatively tse2)gene is excised from the circular nucleic acid molecule. The attB, attP,attL, and attR sites are each identified by single letter identifiers(e.g., B, P, L, and R). The numbers (i.e., 1, 2, and 3) following theletter att site type identifiers refer to recombination sitespecificities.

FIGS. 10D to 10F show three exemplary vector formats which may be usedfor assembly nucleic acids encoding TAL effectors and expression of TALeffector proteins and TAL fusions. FIG. 10D shows a topoisomeraseadapted vector format of a type which can be used for insertion of oneor more nucleic acid segments, resulting in the generation of acircularized molecule. FIGS. 10E and 10F show two vector formats whichcontain recombination sites and can be used for the insertion of nucleicacid segments at either two (FIG. 10E) or one (FIG. 10F) location.Labels are as follows: “Rec#”, recombination sites with specificitieswhich vary with the number; “SM”, selectable marker (e.g., positive ornegative selectable markers); “ori”, origin or replication; “Tag”, a tagsequence (e.g., an affinity tag); “Coding Sequence”, an amino acidcoding sequence formatted so as to result in the generation of a fusionprotein when a another coding sequence is located between Rec1 and Rec2is transcribed and translated; and the arrows represent promoters.

FIG. 11A shows an example of high throughput DNA assembly kit based ontype IIS and topoisomerase mediated cloning. In this instance, a seriesof topoisomerase-adapted vectors containing symmetrical ends aredesigned that differ only in terminal type IIS restriction sites.Following analysis of an input sequence a design software tool selects aspecific topoisomerase-adapted vector that is compatible with thenucleic acid sequence to be assembled and generates subfragments of thesequence. The subfragments obtained, e.g., by PCR are cloned into theselected topoisomerase-adapted vector in unspecific orientation. Theresulting vector library is then combined with a target vector in thepresence of an enzyme mix containing at least one type IIS restrictionenzyme and a ligase, and the subfragments are assembled into the targetvector in directed orientation due to their compatible ends. FIG. 11Adiscloses SEQ ID NO: 138.

FIG. 11B illustrates two different exemplary workflows integrating aseamless cloning kit or the underlying assembly method in servicesoffered by a service provider. In the left workflow an assembly strategyis developed by a service provider based on customer sequenceinformation and the customer is provided with an individual toolkitcomprising a selection of vectors (e.g., topoisomerase-adapted andtarget vectors), a ready-to-use enzyme mix, and competent cells togetherwith an assembly protocol. The customer can use the kit to assembleavailable DNA fragments. The workflow on the right can be applied e.g.,where no DNA template is available or customer requests an optimized ormodified sequence. In this case, the service provider can integrate thevectors and assembly strategy illustrated in FIG. 7A into the internalmanufacturing process to assemble de novo synthesized DNA fragments. Thefull-length synthetic gene is subjected to a quality control (QC)process and shipped to the customer.

FIG. 12A shows one example of an assembly method for generating a randomarrangement of TAL nucleic acid binding cassettes. Bases bound by eachthe various cassettes are indicated by the letters A, T, C, and G. Basesare indicated only for the top of the three Repeat Library membersshown. FIG. 12A discloses SEQ ID NOS 139-140, respectively, in order ofappearance.

FIG. 12B shows four partial TAL nucleic acid binding cassette codingsequences and encoded amino acids. Dashes represent omitted sequencedata. NcoI and Esp3I cut sites are shown in the upper most nucleotidesequence in boxes. The two codon coding sequences encode the amino acidsthat determine base recognition in each of the four TAL cassettes areshown in boxes. Also, bases recognized by each TAL cassette are shownabove the codons coding sequences. FIG. 12B discloses SEQ ID NOS 141,143, 145, 147, 142, 144, 146, 148, 141, 149, 145, 147, 142, 150, 146,148, 141, 151, 145, 147, 142, 152, 146, 148, 141, 153, 145, 147, 142,154, 146 and 148, respectively, in order of appearance.

FIGS. 13A to 13C show an example of a reporter assay suitable todemonstrate TAL function in E. coli. FIG. 13A shows a genetic invertersystem that was created to test whether a TAL effector binds DNA in E.coli, in which the induction of AvrBs3 constructs by addition ofarabinose is predicted to inhibit GFP expression (Des. GFP, destabilizedGFP; Alt. TAL-trun, alternate truncations). FIG. 13B indicates thatthree AvrBs3 C-terminal truncation constructs expressed as fusions tothioredoxin show proper molecular weight in an SDS-PAGE analysis. FIG.13C indicates that reporter strain expressing AvrBs3 constructs showsignificantly decreased fluorescence relative to control strains.(pTrc-UPA, promoter with an UPA20 target sequence).

FIGS. 14A to 14C illustrate an example of TAL responsiveness in algae. ATAL genetic circuit for microalgae was constructed by placing a 3×TALbinding site in front of a minimal promoter driving expression of aluciferase reporter gene. A TAL effector was fused in frame to theN-terminus of hygromycin resistance gene (FIG. 14A). The genetic circuitwith a Hsp70A-Rubisco promoter was used as a positive control and acircuit without TAL effector was used as a negative control. Theconstructs were transformed into algae, followed by Hygromycin Bselection. The selected colonies were assayed for Luciferase expression(FIG. 14B) and TAL expression by Western blot analysis (FIG. 14C).

FIG. 15A shows FACS analyses of cells stably transfected with greenfluorescent protein (GFP) reporter constructs. Two cell lines withstably integrated single copies of TAL response cassettes, wherein GFPreporter expression is either driven by an adenovirus E1b minimalpromoter or a CMV promoter. The indicated plasmids were co-transfectedwith a red fluorescent protein (RFP) expression plasmid as transfectioncontrol into the TAL responsive cell lines. RFP positive cells weregated and analyzed by flow cytometry. Reporter gene expression wasactivated by TAL effectors and TAL-VP16 fusions but not by empty vectoror an irrelevant activator (GAL-VP16) (left). Conversely, GFP proteinexpression was repressed by a TAL-KRAB repressor but not by vectorcontrol and an irrelevant Tet repressor (TetR) in the CMV-GFP reportercell line (right).

FIG. 15B illustrates how synthetic TAL effectors activate reporter geneexpression. The repeat domain of wild-type TAL effector AvrBS3 wasreplaced with repeats designed to target the 13 base pairs of GAL4 DNAbinding sequence. A Gal4 responsive reporter construct, which has 6copies of GAL4 DNA binding sequences upstream and an adenovirus E1bminimal promoter, was used to demonstrate predicted binding of anengineered TAL effector to its co-responding DNA binding sequence andsubsequent activation of gene expression. FIG. 15B discloses SEQ ID NO:155.

FIG. 15C shows an example of TAL-mediated activation in a dualluciferase assay. The endogenous NLS and activation domain of TALeffectors were replaced with NLS and VP16 or VP64 activation domains tocreate fusion activators TAL-VP16 and TAL-VP64. The reporter constructexpresses luciferase from a CMV mini promoter harboring three copies ofcorresponding TAL DNA binding sequences. 293FT cells were co-transfectedwith indicated plasmids along with a Renilla luciferase expressionplasmid. FIG. 15D shows a dual luciferase assay that was performed 48 hpost-transfection.

FIGS. 16A to 16C show TAL-mediated repression in a reporter assay. InFIG. 16A a reporter construct harboring a Tet-responsive binding sitewas used as negative control to demonstrate TAL specificity. The TALrepressor was constructed by replacing the C-terminal activation domainof AvrBs3 with a KRAB domain. The reporter constructs express GFP orLacZ from a full-length CMV promoter harboring TAL DNA binding sequenceor TET binding sequence as a control. Cell cultures co-transfected withthe indicated combinations of plasmids were analyzed in a microscope forGFP expression (left). The repression activity of TAL-KRAB wasdemonstrated to be target-site specific. Furthermore β-galactosidasereporter activity was measured 72 hours post-transfection. The figure isgraphed as the percentage of the signal to the pcDNA3 controltransfection. Co-expression of AvrB3-Krab was more efficient in thepresence of two copies of TAL DNA binding sites and comparable withTet-mediated repression.

FIG. 16B shows an example for the establishment of TAL responsive stablecell lines. A single copy of the TAL responsive reporter cassette wasintegrated into the genome of FLP-IN™-293 cells. The GFP reporter isdriven from a full-length CMV promoter harboring a TAL DNA bindingsequence. Cell of parental stable FLP-In cells and cells with CMV-GFPcassette were analyzed by flow cytometry.

FIGS. 16C to 16J show down regulation of chromosomal genes by engineeredTALs.

A TAL-KRAB repressor was co-transfected with a RFP expression plasmid asa transfection control into FLP-IN™ stable cell line harboring aCMV-1×TAL-GFP reporter cassette. pcDNA3 empty vector and Tet repressorwere used as negative control. siRNA targeting GFP and control siRNAwere co-transfected with RFP expression plasmid. Cell population gatedby RFP positive cells was analyzed by flow cytometry 72 hours posttransfection.

FIGS. 17A and 17B show a TAL effector GFP genomic cleavage assaydesigned to quantitatively assess the ability of a custom TAL nucleasepair to cleave a specific genomic DNA target. Spacers of differentlengths were inserted into a GFP reporter gene to shift the open readingframe such that a non-functional protein is expressed. Reporterconstructs were stably integrated in 293FT cells. The cells weretransfected with TAL ArtX1-FokKK and ArtX2-FokEL nuclease pairs with TALrepeats directed to specific target sites flanking the spacers in theGFP open reading frame. Following nuclease cleavage within the spacerregion, the DNA break is repaired by endogenous non-homologous endjoining pathway leading to partial restoration of GFP expression and arespective shift in green cells.

FIG. 18A shows a TAL effector mammalian transient activation assaydesigned to assess the ability of a custom TAL activator to bind andstimulate transcription at a target site tissue culture cells.PGLOW-TOPO® is a promoterless GFP vector that expresses very low orundetectable levels of GFP when introduced into tissue culture cells. Aspecific TAL binding site is fused in front of a minimal promoter (e.g.,by PCR), and the resulting product is then topoisomerase cloned into theGFP vector. Co-transfection of this plasmid and the custom TAL activatorleads to expression of GFP, which can be detected by various methodssuch as fluorescence microscopy or flow cytometry.

FIG. 18B shows a transient cleavage and repair assay designed to assessthe ability of a custom TAL nuclease pair to bind and cleave a plasmidbearing a specific target site in tissue culture cells. Plasmid A is acleavage target that fuses LacZ, two custom TAL nuclease binding sitesseparated by a spacer of approximately 16 bp, and a GFP fragmentcontaining a 5′ truncation sufficient to render the expressed proteinnon-functional. Plasmids C and D express custom TAL nucleases 1 and 2that bind to DNA sequences 1 and 2 on Plasmid A. The nuclease domains ofeach TAL nuclease then dimerize and generate a double strand DNA breakon Plasmid A. Plasmid B consists of the 3′ end of LacZ fused in frame toGFP. Plasmid B does not contain a promoter and therefore does notexpress GFP. Generation of a double strand break between sequences 1 and2 on Plasmid A stimulates recombination between homologous sequences onPlasmids A and B and leads to the expression of a functional LacZ-GFPfusion protein which can be detected by various methods such asfluorescence microscopy or flow cytometry.

FIG. 19 shows an exemplary assay that can be used to demonstrate TALnuclease-mediated DNA cleavage. TAL nucleases were synthesized using arabbit reticulocyte in vitro transcription/translation system. The DNAtarget sites with different length spacers were cloned into pCRT7/CTvector by over-lapping PCR fragment. The pairs of nucleases expressedfrom in vitro transcription/translation system were incubated with thetarget plasmids or PCR amplicons spanning the target site. The digestproducts were resolved on an agarose gel to demonstrate successfulcleavage. FIG. 19 discloses SEQ ID NOS 156-159, respectively, in orderof appearance.

FIGS. 20A to 20H show an example of sequence mapping of TAL effectornuclease mediated genomic lesions. This approach takes advantage ofmismatch-detecting enzymes (MME) such as that from Perkinsus marinas,Cel1, Res1 or similar to identify modifications in the genome. A)Starting with treated and untreated cell populations, the genomic DNA ispurified (B) and cleaved with a cocktail of restriction enzymes thatresult in an average fragment size of 100 base pairs (C). This can alsobe achieved using mechanical and enzymatic shearing techniques. D) Thosepopulations are mixed, melted and allowed to cross-hybridize resultingin a mismatch at the point of the lesion where the strand from thetreated cell anneals with that from the untreated cell. E) The fragmentsare then adapted using modified Ion PGM ‘P1*’ adapter containing a veryrare restriction site at its 5′ end. This restriction site would ideallybe a rare-cutting type IIS restriction enzyme to be compatible with IONTORRENT™ sequencing primer design. F) After clean up, the mismatch(indicating the lesion to be identified) is cleaved by treatment with anMME. G) The population containing new non-adapted ends is then ligatedwith ‘A’ adapter which does not contain the rare site in ‘P1*’. H) Theentire population is then treated with the rare cutting enzyme torelease the ‘A’ adapter ligated the modified ‘P1*’ adapter. This leavesa population of fragments appended with the ‘P1*’ adapter on each end(non-lesion) and fragments with the ‘P1*’ adapter on one end and ‘A’ligated to the lesion site. This population would then be subjected toPGM sequencing using ‘P1*’ to anchor the fragments to the beads and ‘A’to identify the genomic lesion sites.

FIG. 21 shows one embodiment of a single-site, TAL effector fusionmediated homologous recombination process. The upper, long, thinhorizontal line represents cellular nucleic acid (e.g., a region of acellular chromosome). The line below containing the white box and thewhite rectangle represent nucleic acid which is to be integrated intothe cellular nucleic acid. The white oval represents a cellularpromoter, the white square represents an open reading frame which is tobe integrated into the cellular nucleic acid. The black rectanglerepresents a cellular open reading from which is normally operableconnected to the promoter. “TAL” represents a TAL effector nucleasecleavage site. “PS” represents primer binding sites.

FIGS. 22A to 22E show different vectors suitable for co-expression ofTAL effector pairs wherein TAL effector open reading frames (ORF) areunder control of the same promoter and either separated by an IRES (FIG.22A), a T2A cleavage site (FIG. 22B), a translational coupler sequence(FIG. 22C) or an intein (FIG. 22D), or wherein TAL effector ORFs areexpressed from different expression cassettes on the same vector (FIG.22E).

FIG. 23A shows a schematic of the use of TAL effector fusions for theassembly of a protein complex. In this figure, DNA with a series of TALeffector binding site (labeled “TAL”) is connected to a solid support(labeled “Support”) via a linker segment (labeled “Linker”). The DNAsegment contains spacers (labeled “Spacers”) between each TAL effectorbinding site. TAL effector fusions are shown interacting with the DNA.In addition to a TAL effector, these fusions contain an amino acidsegment which connects the TAL effector to a fusion partner (labeled“Connector”). The DNA and the TAL effector fusions are designed in sucha manner that, upon binding of the fusion to the DNA, the fusionpartners form a protein complex.

FIG. 23B illustrates two possible pathways for producing 2,3-butanediolusing enzymes from E. coli and B. subtilis. E. coli gene products: ilvI,acetolactate synthase large subunit; ilvH, acetolactate synthase isozymeIII small subunit; pdh, pyruvate dehydrogenase; B. subtilis geneproducts: alsD, acetolactate decarboxylase; ydjL, acetoinreductase/2,3-butanediol dehydrogenase; acoABCL operon, encodes the E1α,E1β, E2, and E3 subunits of the acetoin dehydrogenase complex.

FIG. 23C shows an exemplary design of TAL DNA scaffold-assisted assemblyof 2,3-butanediol pathways in microalgae. The genes encoding thedifferent enzymatic activities required for 2,3-butanediol productionare fused to different TAL effector sequences with specific bindingsites (BS) on the DNA TAL scaffold. A flexible linker is insertedbetween the TAL effector and the enzymatic domain to allow forindependent folding and accessibility of the fused enzymatic domains.

FIG. 24 shows an in vitro TAL nuclease cleavage assay and methods forpreparing components used in the same. The left side of the figurediagrammatically shows the preparation of expressions vectors encodingTAL effector nuclease fusions. The expressions vectors labeled “TAL-Fwd”and “TAL-Rev” encode two domains of a FokI restriction endonucleaseconnected to TAL effectors which bind nucleic acid in a manner so as tobring the domains into the correct proximity for nuclease activity.These TAL effector nuclease fusion expression vectors are thentranscribed and translated in vitro to produce two FokI-TAL effectorfusions. The right side of the figure shows a pUC19 vector, primers andassociated PCR reaction designed to generate amplification productswhich, upon hybridization, form a linear nucleic acid molecule with TALeffector fusion binding sites. These binding sites are positioned tobring TAL effector nuclease fusions together in a manner which resultsin cleavage of the nucleic acid between the TAL binding sites. Thebottom center of the figure shows a gel of cleavage reaction mixtures.

FIG. 25A and FIG. 25B show an amino acid alignment between two codingregion-derived protein sequences and consensus sequence regions atidentical or strongly similar positions thereof; the proteins are RBRH01844 of Burkholderia rhizoxinica HKI 454 (GenBank Accession No.YP_004022479) (SEQ ID NO: 48) and RBRH 01776 of the same species(GenBank Accession No. YP_004030669) (SEQ ID NO: 49). For the proteinsequences, identical and strongly similar amino acids are shown as whitetext against a black background while weakly similar amino acids areshown as black text against a white background. The boxed consensussequence regions at the amine and carboxyl termini of the full lengthamino acid sequences are regions that flank the TAL repeat region. Theunderlined regions in the consensus sequence near the amine and carboxyltermini may represent TAL repeats or TAL repeat-like sequences. TALrepeat-like sequences such as these could assist in the formation of thecorrected structure or nucleic acid binding activity of terminal TALrepeats or the TAL repeat region generally.

FIG. 26 shows the coding region-derived amino acid sequence of proteinRBRH_01776 of Burkholderia rhizoxinica (SEQ ID NO:47; GenBank AccessionNo. YP_004030669). The white text on a black background at the amine andcarboxyl termini of the full length amino acid sequence depicts regionsthat flank the TAL repeat region. The underlined regions near the amineand carboxyl termini may represent TAL repeats or TAL repeat-likesequences. The individual TAL repeats are shown in alternating bold,italic text and plain text. The amino acid pairs shown in boxes arerepeat variable diresidue (RVD) sequences within TAL repeats.

FIG. 27 shows an amino acid alignment of TAL repeat structures of thetwo Burkholderia proteins; those repeat structures designatedBurkholderia 1-18 are from protein RBRH_01776 of Burkholderiarhizoxinica HKI 454 (GenBank Accession No. YP_004030669) (SEQ ID NOS160-177, respectively, in order of appearance) and those repeatstructures designated Burkholderia A-T are from protein RBRH_01844 ofthe same species (GenBank Accession No. YP_004022479) (SEQ ID NOS178-197, respectively, in order of appearance). Identical or similaramino acids are shown as white letters on black background andnon-identical amino acids are shown as black letters on whitebackground. The sequences are assigned sequence identification numbersin FIG. 26. The symbols

designate the repeat variable diresidue at amino acid positions 12 and13.

FIG. 28 shows an amino acid alignment of TAL repeat structures from aprotein of an unidentified marine organism (referred to as MarineOrganism A, GenBank Accession No. EBN19409) (SEQ ID NOS 198-206,respectively, in order of appearance). Identical or similar amino acidsare shown as white letters on black background and non-identical aminoacids are shown as black letters on white background. The sequences areassigned sequence identification numbers in FIG. 27. The symbols

designate the repeat variable diresidue at amino acid positions 12 and13.

FIG. 29 shows an amino acid alignment of TAL repeat structures from aprotein of an unidentified marine organism (referred to as MarineOrganism B, GenBank Accession No. ECG96325) (SEQ ID NOS 207-212,respectively, in order of appearance). Identical or similar amino acidsare shown as white letters on black background and non-identical aminoacids are shown as black letters on white background. The sequences areassigned sequence identification numbers in FIG. 28. The symbols

designate the repeat variable diresidue at amino acid positions 12 and13.

FIG. 30 shows an amino acid alignment of TAL nucleic acid bindingcassettes from seven different proteins. Two repeats are from theAvrXa27 protein of Xanthomonas oryzae (GenBank Accession No. AAY54168.1)(SEQ ID NOS 213-214, respectively, in order of appearance), two repeatsare from the Hax3 protein of Xanthomonas campestris (GenBank AccessionNo. AAY43359.1) (SEQ ID NOS 221-222, respectively, in order ofappearance), three repeats are from the PSIO7 protein of Ralstoniasolanacearum (Accession No. YP_003750492.1) (SEQ ID NOS 223-225,respectively, in order of appearance), two repeats are from the Tal5aprotein of Xanthomonas oryzae (Accession No. AEQ96609.1) (SEQ ID NOS226-227, respectively, in order of appearance), three repeats are fromthe Tal11a protein of Xanthomonas oryzae (GenBank Accession No.AEQ98467.1 (SEQ ID NOS 228-230, respectively, in order of appearance),six repeats are from an unidentified blood-borne pathogen protein (BBP)(GenBank Accession No. CCA82456) (SEQ ID NOS 215-220, respectively, inorder of appearance), and six repeats from Marine Organism B from FIG.28 (SEQ ID NOS 207-212, respectively, in order of appearance). Identicaland similar amino acids are shown as white letters on black backgroundand non-identical amino acids are shown as black letters on whitebackground. The symbols

designate the repeat variable diresidue at amino acid positions 12 and13.

FIG. 31A and FIG. 31B show an amino acid alignment between the AvrXa10protein of Xanthomonas oryzae (GenBank Accession No. AAA92974) (SEQ IDNO: 231) and a coding region-derived amino acid sequence of proteinRBRH_01776 of Burkholderia rhizoxinica (GenBank Accession No.YP_004030669) (SEQ ID NO: 47). Identical and strongly similar aminoacids are shown as white text against a black background. Weakly similaramino acids are shown as black text against a white background. Threeroughly 26 amino acid sequences are also enclosed in open boxes as shownin FIG. 31A. These boxes are included to show exemplary amino acidsequences which are fairly highly conserved between the two proteinsrepresented therein.

FIG. 32A and FIG. 32B show an exemplary kit and method for typeIIS-mediated assembly of a TAL effector nuclease based on a dimerlibrary of TAL cassette building blocks. FIG. 32A illustrates how adimer library can be arranged to allow for co-assembly of 4 buildingblocks each into two capture vectors using a minimum set of buildingblocks. FIG. 32B shows how a universal TAL assembly kit relying on acollection of dimer building blocks may be arranged.

FIGS. 33A to 33C show the activation and repression of the endogenoussox2 gene in HeLa cells using TAL activator and TAL repressor proteins.FIG. 33A shows the promoter region of the endogenous sox2 gene withtranscription factor- and TAL-binding sites indicated.

FIG. 33B shows the activation of Sox2 promoter via targeting of TALFLVP64 activator fusion proteins to binding sites 4643 or/and 655. FIG.33C shows the repression of the Sox2 promoter via targeting of TAL KRABrepressor and TAL MCS fusion proteins to binding site 4643. HeLa cellswere transfected the indicated TAL effectors expression plasmids andthen the mRNA levels of sox2 were evaluated by TaqMan assay 72 hourspost transfection and normalized to β-actin.

FIGS. 34A and 34B show expression systems based on the Tse2/Tsi2toxin/antidote effect for the enrichment of TAL nuclease-modified cells.FIG. 34A shows an embodiment, wherein TAL nuclease expression vector(s)are co-delivered to a target host cell with a surrogate reportercontaining a tse2 gene fused to target binding site for the TAL nucleaseand a tsi2 gene placed out of frame. Cells not expressing functional TALnuclease pairs will be killed by Tse2 expression. However, in thepresence of functional TAL nuclease double strand breaks are introducedinto the target cleavage site a portion of which will be repaired byNHEJ putting the tsi2 gene into the correct reading frame. The Tsi2antidote protein will be released via T2A-mediated auto-cleavage andrescue cells from Tse2-associated toxicity. FIG. 34B shows an embodimentwhere a vector set for coexpression of two TALE FokI nuclease cleavagehalf domains each of which is connected in a separate vector to eitherTse2 or Tsi2 via a T2A self cleavage site is delivered into a targethost cell. As nuclease-mediated modification of cells depends on thecoexpression of both TAL nuclease cleavage half domains and the survivalof cells depends on expression of Tsi2, only cells with balanced Tse2and Tsi2 expression levels will survive and grow.

FIG. 35 shows the amino acid sequence of a TAL polymerase fusion protein(SEQ ID NO: 232), referred to as TAL-Bst1.0. The first boxed sequenceshows a V5 epitope (amino acids 1 through 15, bold italicized sequence)and a nuclear localization signal (NLS) (amino acids 16 through 29). A136 amino acid N-terminal region of a TAL effector (Hax3) is shown asamino acids 30 through 165. The first complete and last partial TALrepeats are shown in double lined boxes (amino acids 166 through 748). A135 amino acid N-terminal region of a TAL effector (also Hax3) is shownas amino acids 749 through 883. A linker sequence (GGGVTM) (SEQ ID NO:89) is shown as amino acids 884 through 889. A portion of a DNApolymerase I from Bacillus stearothermophilus (Bst1.0) is shown as aminoacids 890 to 1469. The DNA polymerase contains both an exonucleaseactivity as well as a 5′-3′ DNA polymerase activity (amino acidsequences not delineated).

FIGS. 36A and 36B show an example of how a TAL effector sequence may bedesigned to allow for full-length sequencing of a repetitive TALeffector region. FIG. 36A shows an exemplary TAL effector sequence with24 cassettes where sequencing is performed with a set of forward andreverse primers binding 5′ and 3′ of the TAL repeat coding region (e.g.,within the TAL N- and C-terminus, respectively) and at least oneadditional primer specifically binding to a cassette within the centeror near the center of the series of assembled TAL effector cassettes.The at least one additional primer can be designed to bind in the 5′portion of the TAL effector sequence reading forward or can be designedto bind in the 3′ portion of the TAL effector sequence reading backward(reverse primer). For larger TAL effector sequences two additionalprimers reading forward and backward may be used. FIG. 36B shows how acassette allocated to a specific position within the series of assembledcassettes (in this example position 16) may be designed to allowspecific binding of an additional primer. Based on the repetitivestructure all cassettes may contain at least one homologous region(indicated by a vertical bar in each cassette) with an identicalnucleotide composition. Such homologous region should have a sizesufficient to allow for specific primer binding. For example thehomologous region may contain at least 10, 11, 12, 13, 14, 15, 16, 17,18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30 or more than 30nucleotides. At least one cassette of each binding category (A-, G-, C,T-binder) allocated to a defined position (here position 16 within the24 cassettes) is designed such that the nucleotide composition withinthe homologous region differs from the same region within all othercassettes in all other positions. The nucleotide composition within thehomologous region may differ by at least 3 or 4 nucleotides (indicatedby vertical lines in the hatched homologous region of cassette No. 16)from the nucleotide composition of the same region in all othercassettes to allow stable binding of a sequencing primer to the targetsequence within the selected cassette. The skilled person willunderstand that the number of differing nucleotides depends on thelength of the homologous region available for primer binding and thelength and melting temperature of the sequencing primer. The differingnucleotides are preferably located at the 3′-end of the primer toprevent unspecific binding. To generate cassettes with unique nucleotidecomposition, silent mutations may be introduced without changing theencoded amino acid sequence. As described elsewhere herein suchmodifications may for example be introduced based on the degeneracy ofthe genetic code using alternative codons.

DETAILED DESCRIPTION OF THE INVENTION Definitions

As used herein “TAL nucleic acid binding cassette” (also referred to asa “TAL cassette”) refers to nucleic acid that encodes a polypeptidewhich allows for a protein that the polypeptide is present in to bind asingle base pair (e.g., A, T, C, or G) of a nucleic acid molecule. Inmost instances, proteins will contain more than one polypeptide encodedby a TAL nucleic acid binding cassettes. The individual amino acidsequences of the encoded multimers are referred to as “TAL repeats”. Inmany instances, TAL repeats will be between twenty-eight and forty aminoacids in length and (for the amino acids present) will share at least60% (e.g., at least about 65%, at least about 70%, at least about 75%,at least about 80%, from about 60% to about 95%, from about 65% to about95%, from about 70% to about 95%, from about 75% to about 95%, fromabout 80% to about 95%, from about 85% to about 95%, from about 60% toabout 90%, from about 60% to about 85%, from about 65% to about 90%,from about 70% to about 90%, from about 75% to about 90%, etc.) identitywith the following thirty-four amino acid sequence:

LTPDQVVAIA SXXGGKQALE TVQRLLPVLC QAHG (SEQ ID NO: 7)

As explained in addition detail elsewhere herein, the two Xs atpositions twelve and thirteen in the above sequence represent amino acidwhich also TAL nucleic acid binding cassettes to recognize a specificbase in an nucleic acid molecule.

In many instances, the final TAL repeat present at the carboxyl terminusof a series of repeats series will often be a partial TAL repeat in thatthe carboxyl terminal end may be missing (e.g., roughly the aminoterminal 15 to 20 amino acids of this final TAL repeat).

Nucleotide and amino acid sequence may be compared to each other by anumber of means. For example a number of publicly available computerprograms may be used to compare sequences.

In sequence comparisons, typically one sequence acts as a referencesequence, to which test sequences are compared. When using a sequencecomparison algorithm, test and reference sequences are entered into acomputer, subsequence coordinates are designated, if necessary, andsequence algorithm program parameters are designated. Default programparameters can be used, as described below for the BLASTN (nucleicacids) and BLASTP (proteins) programs, or alternative parameters can bedesignated. The sequence comparison algorithm then calculates thepercent sequence identities for the test sequences relative to thereference sequence, based on the program parameters. Alignment ofsequences for comparison can also be conducted, e.g., by the localhomology algorithm of Smith & Waterman, Adv. Appl. Math. 2:482 (1981),by the homology alignment algorithm of Needleman & Wunsch, J. Mol. Biol.48:443 (1970), by the search for similarity method of Pearson & Lipman,Proc. Natl. Acad. Sci. USA 85:2444 (1988), by computerizedimplementations of these algorithms (GAP, BESTFIT, FASTA, and TFASTA inthe Wisconsin Genetics Software Package, Genetics Computer Group, 575Science Drive, Madison, Wis., USA).

One algorithm suitable for determining percent sequence identity andsequence similarity are the BLAST and BLAST 2.0 algorithms, which aredescribed in Altschul et al., Nucl. Acids Res. 25:3389-3402 (1977) andAltschul et al., J Mol. Biol. 215:403-410 (1990), respectively. Softwarefor performing BLAST analyses is publicly available through the NationalCenter for Biotechnology Information (seehttp://blast.ncbi.nlm.nih.gov/Blast.cgi). In some instances, amino acidsequence comparisons as performed using the algorithm designated blastp(protein-protein BLAST) with the default settings.

As used herein “TAL effector” refers to proteins composed of more thanone TAL repeat and is capable of binding to nucleic acid in a sequencespecific manner. In many instances, TAL effectors will contain at leastsix (e.g., at least 8, at least 10, at least 12, at least 15, at least17, from about 6 to about 25, from about 6 to about 35, from about 8 toabout 25, from about 10 to about 25, from about 12 to about 25, fromabout 8 to about 22, from about 10 to about 22, from about 12 to about22, from about 6 to about 20, from about 8 to about 20, from about 10 toabout 22, from about 12 to about 20, from about 6 to about 18, fromabout 10 to about 18, from about 12 to about 18, etc.) TAL repeats. Insome instances, a TAL effector may contain 18 or 24 or 17.5 or 23.5 TALnucleic acid binding cassettes. In additional instances, a TAL effectormay contain 15.5, 16.5, 18.5, 19.5, 20.5, 21.5, 22.5 or 24.5 TAL nucleicacid binding cassettes. TAL effectors will generally have at least onepolypeptide region which flanks the region containing the TAL repeats.In many instances, flanking regions will be present at both the aminoand carboxyl termini of the TAL repeats.

As used herein “TAL effector fusion” refers to a TAL effector connectedto another polypeptide or protein to which it is not naturallyassociated with in nature. In many instances, the non-TAL component ofthe TAL effector fusion will confer a functional activity (e.g., anenzymatic activity) upon the fusion protein. The one or more connectedpolypeptides or proteins may have functions equal to or different fromthe TAL effector. For example, a TAL effector fusion may also havebinding activity or may have an activity that directly or indirectlytriggers nucleic acid modification, such as, e.g., an enzymaticactivity.

In one aspect, the function of a TAL effector may be embodied by thebinding activity per se. Specific binding of a TAL effector to a targetsequence may, e.g., block the sequence or repress downstream events ormay allow for detection of a sequence or recruit other molecules. Inmany instances TAL effectors also include proteins wherein the TALrepeat is operatively linked with at least one other activity. TALeffectors engineered to bind specific DNA targets can be designedaccording to rational criteria applying known TAL code rules,computerized algorithms for processing information in a database storinginformation of existing RVD designs and binding data. Functional TALeffectors can further be selected from rationally designed libraries indirected evolution approaches described elsewhere herein.

TAL effectors may be fused to DNA modifying enzymes capable of modifyingthe genetic material of a cell by, for example, cleavage, covalentinteraction, water-mediated interaction or the like. The TAL fusionpartner may be any DNA interacting or modifying protein such as, forexample, an activator or a repressor, a nuclease, a topoisomerase, agyrase, a ligase, a glycosylase, an acetylase, a deacetylase, anintegrase, a transposase, a methylase, a demethylase, amethyl-transferase, a homing endonuclease, a kinase, a recombinase, aligase, a phosphatase, a sulphurilase or an inhibitor of the one or moreactivities of one or more of such TAL fusion partners.

As used herein a TAL binding site or target binding site refers to anyorder of bases in a given nucleic acid sequence that can be recognizedand bound by a TAL effector. Such binding site can be provided either inthe context of double-stranded DNA or alternatively in the context of aDNA-RNA hybrid, wherein the DNA strand determines binding specificity.If the binding site is provided in the context of double-stranded DNA itcan be methylated or unmethylated.

As used herein the term “nucleic acid molecule” refers to a covalentlylinked sequence of nucleotides or bases (e.g., ribonucleotides for RNAand deoxyribonucleotides for DNA but also include DNA/RNA hybrids wherethe DNA is in separate strands or in the same strands) in which the 3′position of the pentose of one nucleotide is joined by a phosphodiesterlinkage to the 5′ position of the pentose of the next nucleotide.Nucleic acid molecule may be single- or double-stranded or partiallydouble-stranded. Nucleic acid molecule may appear in linear orcircularized form in a supercoiled or relaxed formation with blunt orsticky ends and may contain “nicks”. Nucleic acid molecule may becomposed of completely complementary single strands or of partiallycomplementary single strands forming at least one mismatch of bases.Nucleic acid molecule may further comprise two self-complementarysequences that may form a double-stranded stem region, optionallyseparated at one end by a loop sequence. The two regions of nucleic acidmolecule which comprise the double-stranded stem region aresubstantially complementary to each other, resulting inself-hybridization. However, the stem can include one or moremismatches, insertions or deletions. Nucleic acid molecules may comprisechemically, enzymatically, or metabolically modified forms of nucleicacid molecules or combinations thereof. Chemically synthesized nucleicacid molecules may refer to nucleic acids typically less than or equalto 150 nucleotides long (e.g., between 5 and 150, between 10 and 100,between 15 and 50 nucleotides in length) whereas enzymaticallysynthesized nucleic acid molecules may encompass smaller as well aslarger nucleic acid molecules as described elsewhere in the application.Enzymatic synthesis of nucleic acid molecules may include stepwiseprocesses using enzymes such as polymerases, ligases, exonucleases,endonucleases or the like or a combination thereof. Thus, the inventionprovides, in part, compositions and combined methods relating to theenzymatic assembly of chemically synthesized nucleic acid molecules.

Nucleic acid molecule also refers to short nucleic acid molecules, oftenreferred to as, for example, primers or probes. Primers are oftenreferred to as single stranded starter nucleic acid molecules forenzymatic assembly reactions whereas probes may be typically used todetect at least partially complementary nucleic acid molecules. Anucleic acid molecule has a “5′-terminus” and a “3′-terminus” becausenucleic acid molecule phosphodiester linkages occur between the 5′carbon and 3′ carbon of the pentose ring of the substituentmononucleotides. The end of a nucleic acid molecule at which a newlinkage would be to a 5′ carbon is its 5′ terminal nucleotide. The endof a nucleic acid molecule at which a new linkage would be to a 3′carbon is its 3′ terminal nucleotide. A terminal nucleotide or base, asused herein, is the nucleotide at the end position of the 3′- or5′-terminus. A nucleic acid molecule sequence, even if internal to alarger nucleic acid molecule (e.g., a sequence region within a nucleicacid molecule), also can be said to have 5′- and 3′-ends.

A “wild-type sequence” as used herein refers to any given sequence(e.g., an isolated sequence) that can be used as template for subsequentreactions or modifications. As understood by the skilled artisan, awild-type sequence may include a nucleic acid sequence (such as DNA orRNA or combinations thereof) or an amino acid sequence or may becomposed of different chemical entities. In some embodiments, thewild-type sequence may refer to an in silico sequence which may be thesequence information as such or sequence data that can be stored in acomputer readable medium in a format that is readable and/or editable bya mechanical device. A wild-type sequence (reflecting a given order ofnucleotide or amino acid symbols) can be entered, e.g., into a customerportal via a web interface. In most instances, the sequence initiallyprovided by a customer would be regarded as wild-type sequence in viewof downstream processes based thereon—irrespective of whether thesequence itself is a natural or modified sequence, i.e., it was modifiedwith regard to another wild-type sequence or is completely artificial.

In some instances wild-type sequence may also refer to a physicalmolecule such as a nucleic acid molecule (such as RNA or DNA orcombinations thereof) or a protein, polypeptide or peptide composed ofamino acids. Methods to obtain a wild-type sequence by chemical,enzymatic or other means are known in the art. In one embodiment, aphysical nucleic acid wild-type sequence may be obtained by PCRamplification of a corresponding template region or may be synthesizedde novo based on assembly of synthetic oligonucleotides. A wild-typesequence as used herein can encompass naturally occurring as well asartificial (e.g., chemically or enzymatically modified) parts orbuilding blocks. A wild-type sequence can be composed of two or multiplesequence parts. A wild-type sequence can be, e.g., a coding region, anopen reading frame, an expression cassette, an effector domain, a repeatdomain, a promoter/enhancer or terminator region, an untranslated region(UTR) but may also be a defined sequence motif, e.g., a binding,recognition or cleavage site within a given sequence. A wild-typesequence can be both, DNA or RNA of any length and can be linear,circular or branched and can be either single-stranded or doublestranded.

“Optimization” of a sequence as used herein shall include all aspects ofsequence modification of a given wild-type sequence to improve orprepare the sequence for a specific purpose or application. Optimizationcan be performed in silico, e.g., by computer-implemented methods usingspecific algorithms or software. A given wild-type sequence may becompletely optimized (e.g., over its entire length). Alternatively, onlyparts or domains of the sequence may be subject to an optimizationprocess. In some instances optimization includes modification of aphysical molecule e.g., by replacing, inserting or deleting one or moreelements in the sequence. By way of example a protein sequence orfunction can be optimized by modification of the underlying nucleic acidsequence. This can be achieved by molecular methods known in the artsuch as mutation, shuffling or recombination approaches or by de novosynthesis of modified sequence parts.

Optimization of a wild-type sequence may include silent codon changes toreplace non- or less preferred codons by more preferred codons withoutmodifying the encoded amino acid sequence. Codon-optimization may forexample impact expression yields, solubility, protein activity, proteinfolding or other functions of an expression product. Optimization of thecodon bias of a wild-type sequence is often employed to allow foroptimal expression of a given gene in a homologous or heterologous host.For example, a gene originally derived from plant, virus, bacteria,yeast etc. may be adapted to the preferred codon usage of mammaliancells to achieve optimal expression yields in a mammalian host and viceversa. Apart from codon usage certain sequence motifs such as splicesites, cis-active inhibitory RNA motifs (often referred to as CRS orINS), internal poly-adenylation signal sequences such as, e.g., AUUUA,or silencing motifs may have to be eliminated to allow for heterologousexpression. Furthermore, specific motifs triggering expression can befused to (e.g., 5′ or 3′-UTR regions) or inserted (such as, e.g.,modification of the intragenic CpG dinucleotide content) in a sequenceto modulate expression or activity of an expression product in aspecific host.

In genetic engineering selectable markers are widely used as reportersystems to evaluate the success of cloning strategies or celltransduction efficiency. Various selection marker genes are known in theart often encoding antibiotic resistance function for selection inprokaryotic (e.g. against ampicillin, kanamycin, tetracycline,chloramphenycol, zeocin, spectinomycin/streptomycin) or eukaryotic cells(e.g. geneticin, neomycin, hygromycin, puromycin, blasticidin, zeocin)under selective pressure. Other marker systems allow for screening andidentification of wanted or unwanted cells such as the well-knownblue/white screening system used in bacteria to select positive clonesin the presence of X-gal or fluorescent reporters such as green or redfluorescent proteins expressed in successfully transduced host cells.Another class of selection markers most of which are only functional inprokaryotic systems relates to counter selectable marker genes oftenalso referred to as “death genes” which express toxic gene products thatkill producer cells. Examples of such genes include sacB, rpsL(strA),tetAR, pheS, thyA, gata-1, or ccdB, the function of which is describedin Reyrat et al. Counterselectable Markers: Untapped Tools for BacterialGenetics and Pathogenesis. Infect Immun. 66(9): 4011-4017 (1998).

A “counter selectable” marker (also referred to herein a “negativeselectable marker”) or marker gene as used herein refers to any gene orfunctional variant thereof that allows for selection of wanted vectors,clones, cells or organisms by eliminating unwanted elements. Thesemarkers are often toxic or otherwise inhibitory to replication undercertain conditions which often involve exposure to a specific substratesor shift in growth conditions. Counter selectable marker genes are oftenincorporated into genetic modification schemes in order to select forrare recombination or cloning events that require the removal of themarker or to selectively eliminate plasmids or cells from a givenpopulation. They have been used for the selection of transformedbacteria or to identify mutants in genetic engineering and are likewiseappropriate for use in certain aspects of the invention. Such selectablemarker genes help to significantly boost cloning efficiency by reducingthe background in cloning experiments represented by uncut orrecircularized empty background vectors lacking an insert. Negativeselection requires a loss of the marker function which may be achievedby different strategies. In a first embodiment the toxic function may,e.g., be destroyed by insertion of a DNA fragment or gene of interestinto either the open reading frame (ORF) of the marker gene orinto/prior to the regulatory region (e.g. promoter region) therebyinterfering with marker gene expression (“insertion strategy”).Alternatively, a DNA fragment or gene of interest may be insertedthereby completely replacing the marker gene (“replacement strategy”).Whereas most of the embodiments described elsewhere herein refer to thereplacement strategy it is understood by the skilled person that vectorsused in methods of the invention can be adapted to use the insertionstrategy instead. In both cases cloning vectors which carry the DNAfragment or gene of interest within or instead of the selectable markerORF will allow bacterial growth and selection of positive clones (i.e.carrying the desired insert) whereas cells obtaining the marker geneexpression construct will die and automatically be sorted out.

One example of a negative selectable marker system widely used inbacterial cloning methods is the CcdA/CCdB Type II Toxin-antitoxinsystem. The system encodes two proteins: the 101 amino acid (11.7 kDa)CcdB toxin which inhibits cell proliferation by forming a complex withthe GyrA subunit of DNA gyrase, a bacterial topoisomerase II, and the 72amino acid CcdA antidote (8.7 kDa) which prevents the toxic effect byforming a tight complex with CcdB. The CcdA/CCdB system is located onthe F-plasmid and functions in plasmid maintenance in E. coli by killingthose daughter cells that have not inherited a copy of the F-plasmid atcell division which is also referred to as “post-segregational killing”(Bernard and Couturier. Mol. Gen. Genet. 226, 297-304 (1991); Salmon etal. Mol. Gen. Genet. 244, 530-538 (1994)). In order to use this systemfor cloning purposes the CcdB encoding gene can be inserted into cloningor expression vectors to kill bacteria which have not received arecombinant vector carrying a gene or DNA molecule of interest. Oneexample where the ccd selection system has been successfully employed isthe Gateway® Technology offered by Invitrogen/Life Technologies(Carlsbad, Calif.) which relies on replacement of the ccdB gene by a DNAfragment or gene of interest via site-specific homologous recombinationand is described in more detail elsewhere herein.

In certain instances it may be required to amplify or propagate a vectorcarrying a negative selectable marker gene. In toxin-sensitive bacteriathis may be achieved by using an inducible marker gene expressioncassette. Another possibility is the provision of a host strain which isresistant to the toxic effects of the marker protein. For example, toallow for propagation of vectors carrying a ccdB gene, host strains havebeen genetically engineered to carry a CcdA expression cassette whichguarantees survival of bacteria receiving a ccdB-containing vector. SuchccdB Survival™ strain is offered by Invitrogen/Life Technologies(Carlsbad, Calif.). Furthermore, CcdA expression host strains aredescribed in U.S. Pat. No. 7,176,029 which is incorporated by referencein its entirety herein.

Another example of a selection system that relies on toxin-antitoxininteraction is the Tse2/Tsi2 system. The two components are derived fromthe type-6-secretion system (T6SS) which was shown to be used byPseudomonas aeruginosa to inject type VI secretion exported 1-3 effectorproteins (Tse1, Tse2 and Tse3) into the periplasmic space of neighboredcompeting Gram-negative bacteria thereby inhibiting target cellproliferation (Hood et al. Cell Host Microbe. 7(1):25-37(2010)).However, to avoid self-intoxication by Tse2 part of which also remainsin the cyotsol of P. aeruginosa, the cytosolic type VI secretionimmunity 2 protein (Tsi2) which neutralizes the toxic effects of Tse2must be present in the cell. Tse2 has been shown to inhibit essentialcellular processes in a broad spectrum of organisms includingprokaryotic (e.g. E. coli, Burkholderia thailandensis) or eukaryoticcells (e.g. S. cerevisiae, HeLa cells) which makes it an attractiveuniversal selection marker. A Tse2 encoding expression cassette(containing a tse2 gene operationally linked to a regulatory sequence)can therefore be inserted into cloning vectors to allow counterselection of positive clones containing inserted DNA fragments or a geneof interest whereas those cells which have received a Tse2 expressingplasmid will be sorted out. As described above, the Tse2 expressioncassette can be adapted to allow either insertion or replacement of oneor more DNA fragments or a gene of interest. Various vectors allowingfor inducible or constitutive expression of the tse2 gene (or truncatedor mutated versions thereof) as counter selectable marker forrecombinational, TOPO, TA- or restriction enzyme cleavage-mediatedcloning are described in U.S. Patent Publication No. 2012/0270271 whichis incorporated by reference in its entirety herein.

In certain instances it may be required to amplify or propagate a vectorcarrying a tse2 gene in a host cell. In Tse2 sensitive cells, this canbe achieved by either making Tse2 expression inducible or by providingan antidote to confer immunity upon Tse2 expressing cells. The antidotecan be any expression product capable of interfering with the cytotoxicactivity of Tse2, including but not limited to Tse2 antisenseconstructs, Tse2 binding aptamers and Tse2 binding polypeptides. In oneembodiment an inducible Tsi2 expression cassette can be included in thevector containing a Tse2 expression cassette. Another possibility is theco-expression with a Tsi2 coding vector or the provision of a hoststrain expressing the Tsi2 antidote to render a cell immune towards Tse2expression. In certain embodiments it may be required to use a host cellwhich has been genetically engineered to carry a Tsi2 expressioncassette chromosomally integrated or on an extrachromosomal element.Different embodiments providing suitable Tse2 antidotes or recombinantTse2 expressing immune host cells are described in U.S. PatentPublication No. 2011/0311499 which is incorporated by reference in itsentirety herein.

Any of the vectors used in embodiments of the invention (includingcloning vectors, expression vectors, capture vectors, viral vectors orfunctional vectors) can be modified to carry counter selectable markergenes such as ccdB or tse2 or functional variants thereof. In certaininstances it may be preferred to use a sequence-optimized version of aselectable marker gene such as, e.g., a ccdB gene or a tse2 gene adaptedto the preferred codon usage of E. coli. To achieve improved expressionof a selectable marker gene in a specific host cell, procedures ofsequence and/or codon optimization as described above may be pursued.

A “vector” as used herein is a nucleic acid molecule that can be used asa vehicle to transfer genetic material into a cell. A vector can be aplasmid, a virus or bacteriophage, a cosmid or an artificial chromosomesuch as, e.g., yeast artificial chromosomes (YACs) or bacterialartificial chromosomes (BAC). In most instances a vector refers to a DNAmolecule harboring at least one origin of replication, a multiplecloning site (MCS) and one or more selection markers. A vector istypically composed of a backbone region and at least one insert ortransgene region or a region designed for insertion of a DNA fragment ortransgene such as a MCS. The backbone region often contains an origin ofreplication for propagation in at least one host and one or moreselection markers. In most instances a vector contains additionalfeatures. Such additional features may include natural or syntheticpromoters, genetic markers, antibiotic resistance cassettes or selectionmarkers (e.g., toxins such as ccdB or tse2), epitopes or tags fordetection, manipulation or purification (e.g., V5 epitope, c-myc,hemagglutinin (HA), FLAG™, polyhistidine (His),glutathione-S-transferase (GST), maltose binding protein (MBP)),scaffold attachment regions (SARs) or reporter genes (e.g., greenfluorescent protein (GFP), red fluorescence protein (RFP), luciferase,β-galactosidase etc.). In most instances vectors are used to isolate,multiply or express inserted DNA fragments in a target host. A vectorcan for example be a cloning vector, an expression vector, a functionalvector, a capture vector, a co-expression vector (for expression of morethan one open reading frame), a viral vector or an episome (i.e., anucleic acid capable of extrachromosomal replication) etc.

A “cloning vector” as used herein includes any vector that can be usedto delete, insert, replace or assemble one or more nucleic acidmolecules. In some instances a cloning vector may contain a counterselectable marker gene (such as, e.g., ccdB or tse2) that can be removedor replaced by another transgene or DNA fragment. In some instances acloning vector may be referred to as donor vector, entry vector, shuttlevector, destination vector, target vector, functional vector or capturevector. Cloning vectors typically contain a series of unique restrictionenzyme cleavage sites (e.g., type II or type IIS) for removal, insertionor replacement of DNA fragments. Alternatively, DNA fragments can bereplaced or inserted by TOPO® Cloning or recombination as, e.g.,employed in the GATEWAY® Cloning System offered by Invitrogen/LifeTechnologies (Carlsbad, Calif.) and described in more detail elsewhereherein. A cloning vector that can be used for expression of a transgenein a target host may also be referred to as expression vector. In someinstances a cloning vector is engineered to obtain a TAL nucleic acidbinding cassette, a TAL repeat, a TAL effector or a TAL effector fusion.

An “expression vector” is designed for expression of a transgene andgenerally harbors at least one promoter sequence that drives expressionof the transgene. Expression as used herein refers to transcription of atransgene or transcription and translation of an open reading frame andcan occur in a cell-free environment such as a cell-free expressionsystem or in a host cell. In most instances expression of an openreading frame or a gene results in the production of a polypeptide orprotein. An expression vector is typically designed to contain one ormore regulatory sequences such as enhancer, promoter and terminatorregions that control expression of the inserted transgene. Suitableexpression vectors include, without limitation, plasmids and viralvectors. Vectors and expression systems for various applications areavailable from commercial suppliers such as Novagen (Madison, Wis.),Clontech (Palo Alto, Calif.), Stratagene (La Jolla, Calif.), and LifeTechnologies Corp. (Carlsbad, Calif.). In some instances an expressionvector is engineered for expression of a TAL nucleic acid bindingcassette, a TAL repeat, a TAL effector or a TAL effector fusion.

A “capture vector” as used herein is a vector suitable for assembly ofTAL cassettes. A capture vector contains a region for TAL cassetteinsertion that is typically flanked by restriction cleavage sites suchas type IIS cleavage sites. The capture vector may contain a counterselectable marker gene such as, e.g., ccdB or tse2. Different capturevectors can be used for assembly of different TAL cassettes. In someinstances, all required TAL cassettes may be assembled into a singlecapture vector. In other instances, at least two capture vectors may beused to assemble all required TAL cassettes. For example, for theassembly of n TAL cassettes, 1−n/2 cassettes may be assembled into afirst capture vector and (n/2+1)−n TAL cassettes may be assembled into asecond capture vector and both capture vectors may be combined in asubsequent reaction to assemble the TAL cassettes of the first capturevector and the TAL cassettes of the second capture vector into a thirdvector or third capture vector. In another example, three capturevectors may be used wherein each of the three capture vectors carriesone third of the total amount of TAL cassettes to be assembled. In yetanother example the amount of TAL cassettes assembled into each capturevector may be different. For example, capture vectors 1, 2, 3 and 4 maycomprise 12 cassettes, 6 cassettes, 4 cassettes and 2 cassettesrespectively, which may further be combined stepwise or in parallelreactions into 24 cassettes.

A “functional vector” as used herein refers to a vector that containseither a TAL effector sequence or a TAL effector fusion sequence (withor without TAL nucleic acid binding cassettes and/or TAL repeats,respectively). For example, a functional vector can carry the flankingN- and C-termini of a TAL effector, wherein the sequence between thetermini contains a counter selectable marker (such as, e.g., ccdB ortse2) that can be removed or replaced by TAL cassettes via type IIScleavage. In many instances a functional vector contains an effectorfusion domain, such as, e.g., a DNA binding or enzymatic activity. Afunctional vector may, e.g., carry a TAL effector fusion encoding anuclease, an activator, a repressor or may contain a multiple cloningsite. In certain aspects a functional vector may be an expressionvector. In some instances a functional vector maybe atopoisomerase-adapted vector or a GATEWAY® Entry Clone.

A “viral vector” generally relates to a genetically-engineerednoninfectious virus containing modified viral nucleic acid sequences. Inmost instances a viral vector contains at least one viral promoter andis designed for insertion of one or more transgenes or DNA fragments. Insome instances a viral vector is delivered to a target host togetherwith a helper virus providing packaging or other functions. In manyinstances viral vectors are used to stably integrate transgenes into thegenome of a host cell. A viral vector may be used for delivery and/orexpression of transgenes.

Viral vectors may be derived from bacteriophage, baculoviruses, tobaccomosaic virus, vaccinia virus, retrovirus (avian leukosis-sarcoma,mammalian C-type, B-type viruses, D type viruses, HTLV-BLV group,lentivirus, spumavirus), adenovirus, parvovirus (e.g., adenoassociatedviruses), coronavirus, negative strand RNA viruses such asorthomyxovirus (e.g., influenza virus) or sendai virus, rhabdovirus(e.g., rabies and vesicular stomatitis virus), paramyxovirus (e.g.,measles and Sendai), positive strand RNA viruses such as picornavirusand alphavirus (such as Semliki Forest virus), and double-stranded DNAviruses including adenovirus, herpes virus (e.g., Herpes Simplex virustypes 1 and 2, Epstein-Barr virus, cytomegalovirus), and poxvirus (e.g.,vaccinia, fowlpox and canarypox). Other viruses include withoutlimitation Norwalk virus, togavirus, flavivirus, reoviruses,papovavirus, hepadnavirus, and hepatitis virus. For example common viralvectors used for gene delivery are lentiviral vectors based on theirrelatively large packaging capacity, reduced immunogenicity and theirability to stably transduce with high efficiency a large range ofdifferent cell types. Such lentiviral vectors can be “integrative”(i.e., able to integrate into the genome of a target cell) or“non-integrative” (i.e., not integrated into a target cell genome).Expression vectors containing regulatory elements from eukaryoticviruses are often used in eukaryotic expression vectors, e.g., SV40vectors, papilloma virus vectors, and vectors derived from Epstein-Barrvirus. Other exemplary eukaryotic vectors include pMSG, pAV009/A+,pMTO10/A+, pMAMneo-5, baculovirus pDSVE, and any other vector allowingexpression of proteins under the direction of the SV40 early promoter,SV40 late promoter, metallothionein promoter, murine mammary tumor viruspromoter, Rous sarcoma virus promoter, polyhedrin promoter, or otherpromoters shown effective for expression in eukaryotic cells.

“Regulatory sequence” as used herein refers to nucleic acid sequencesthat influence transcription and/or translation initiation and rate,stability and/or mobility of a transcript or polypeptide product.Regulatory sequences include, without limitation, promoter sequences orcontrol elements, enhancer sequences, response elements, proteinrecognition sites, inducible elements, protein binding sequences,transcriptional start sites, termination sequences, polyadenylationsequences, introns, 5′ and 3′ untranslated regions (UTRs) and otherregulatory sequences that can reside within coding sequences, such assplice sites, inhibitory sequence elements (often referred to as CNS orINS such known from some viruses), secretory signals, NuclearLocalization Signal (NLS) sequences, inteins, translational couplersequences, protease cleavage sites as described in more detail elsewhereherein. A 5′ untranslated region (UTR) is transcribed, but nottranslated, and is located between the start site of the transcript andthe translation initiation codon and may include the +1 nucleotide. A 3′UTR can be positioned between the translation termination codon and theend of the transcript. UTRs can have particular functions such asincreasing mRNA message stability or translation attenuation. Examplesof 3′ UTRs include, but are not limited to polyadenylation signals andtranscription termination sequences. Regulatory sequences may beuniversal or host- or tissue-specific.

A “promoter” as used herein is a transcription regulatory sequence whichis capable of directing transcription of a nucleic acid segment (e.g., atransgene comprising, for example, an open reading frame) when operablyconnected thereto. A promoter is a nucleotide sequence which ispositioned upstream of the transcription start site (generally near theinitiation site for RNA polymerase II). A promoter typically comprisesat least a core, or basal motif, and may include or cooperate with atleast one or more control elements such as upstream elements (e.g.,upstream activation regions (UARs)) or other regulatory sequences orsynthetic elements. A basal motif constitutes the minimal sequencenecessary for assembly of a transcription complex required fortranscription initiation. In many instances, such minimal sequenceincludes a “TATA box” element that may be located between about 15 andabout 35 nucleotides upstream from the site of transcription initiation.Basal promoters also may include a “CCAAT box” element (typically thesequence CCAAT) and/or a GGGCG sequence, which can be located betweenabout 40 and about 200 nucleotides, typically about 60 to about 120nucleotides, upstream from the transcription start site.

The choice of a promoter to be included in an expression vector dependsupon several factors, including without limitation efficiency,selectability, inducibility, desired expression level, and cell ortissue specificity. For example, tissue-, organ- and cell-specificpromoters that confer transcription only or predominantly in aparticular tissue, organ, and cell type, respectively, can be used. Insome instances, promoters that are essentially specific to seeds(“seed-preferential promoters”) can be useful. In many instances,constitutive promoters are used that can promote transcription in mostor all tissues of a specific species. Other classes of promotersinclude, but are not limited to, inducible promoters, such as promotersthat confer transcription in response to external stimuli such aschemical agents, developmental stimuli, or environmental stimuli.Inducible promoters may be induced by pathogens or stress like cold,heat, UV light, or high ionic concentrations or may be induced bychemicals. Examples of inducible promoters are the eukaryoticmetallothionein promoter, which is induced by increased levels of heavymetals; the prokaryotic lacZ promoter, which is induced in response toisopropyl-β-D-thiogalacto-pyranoside (IPTG); and eukaryotic heat shockpromoters, which are induced by raised temperature. Numerous additionalbacterial and eukaryotic promoters suitable for use with the inventionare known in the art and described in re, e.g., in Sambrook et al.,Molecular Cloning, A Laboratory Manual (2nd ed. 1989; 3rd ed., 2001);Kriegler, Gene Transfer and Expression: A Laboratory Manual (1990); andAusubel et al., Current Protocols in Molecular Biology. Bacterialexpression systems for expressing the ZFP are available in, e.g., E.coli, Bacillus sp., and Salmonella (Palva et al. Secretion of interferonby Bacillus subtilis. Gene 22:229-235 (1983)). Kits for such expressionsystems are commercially available. Eukaryotic expression systems formammalian cells, yeast, and insect cells are well known by those ofskill in the art and are also commercially available.

Common promoters for prokaryotic protein expression are e.g., lacpromoter or trc and tac promoter (IPTG induction), tetApromoter/operator (anhydrotetracyclin induction), PPBAD promoter(L-arabinose induction), rhaPBAD promoter (L-rhamnose induction) orphage promoters such as phage promoter pL (temperature shift sensitive),T7, T3, SP6, or T5.

Common promoters for mammalian protein expression are, e.g.,Cytomegalovirus (CMV) promoter, SV40 promoter/enhancer, Vaccinia viruspromoter, Viral LTRs (MMTV, RSV, HIV etc.), E1B promoter, promoters ofconstitutively expressed genes (actin, GAPDH), promoters of genesexpressed in a tissue-specific manner (albumin, NSE), promoters ofinducible genes (Metallothionein, steroid hormones).

Numerous promoter for expression of nucleic acids in plants are knownand may be used in the practice of the invention. Such promoter may beconstitutive, regulatable, and/or tissue-specific (e.g., seed specific,stem specific, leaf specific, root specific, fruit specific, etc.).Exemplary promoters which may be used for plant expression include theCauliflower mosaic virus 35S promoter and promoter for the followinggenes: the ACT11 and CAT3 genes from Arabidopsis, the gene encodingstearoyl-acyl carrier protein desaturase from Brassica napus (GenBankNo. X74782), and the genes encoding GPC1 (GenBank No. X15596) and GPC2(GenBank No. U45855) from maize. Additional promoters include thetobamovirus subgenomic promoter, the cassaya vein mosaic virus (CVMV)promoter (which exhibits high transcriptional activity in vascularelements, in leaf mesophyll cells, and in root tips), thedrought-inducible promoter of maize, and the cold, drought, and highsalt inducible promoter from potato. A number of additional promoterssuitable for plant expression are found in U.S. Pat. No. 8,067,222, thedisclosure of which is incorporated herein by reference.

Heterologous expression in chloroplast of microalgae such as, e.g.,Chlamydomonas reinhardtii can be achieved using, for example, the psbApromoter/5′ untranslated region (UTR) in apsbA-deficient geneticbackground (due to psbA/D1-dependent auto-attenuation) or by fusing thestrong 16S rRNA promoter to the 5′ UTR of the psbA and atpA genes to theexpression cassette as, for example, disclosed in Rasala et al.,“Improved heterologous protein expression in the chloroplast ofChlamydomonas reinhardtii through promoter and 5′ untranslated regionoptimization”, Plant Biotechnology Journal, Volume 9, Issue 6, pages674-683, (2011).

The promoter used to direct expression of a TAL effector encodingnucleic acid depends on the particular application. For example, astrong constitutive promoter is typically used for expression andpurification of TAL-effector fusion proteins. In contrast, when a TALeffector nuclease fusion protein is administered in vivo for generegulation, it may be desirable to use either a constitutive or aninducible promoter, depending on the particular use of the TAL effectornuclease fusion protein and other factors. In addition, a promotersuitable for administration of a TAL effector nuclease fusion proteincan be a weak promoter, such as HSV thymidine kinase or a promoterhaving similar activity. The promoter typically can also includeelements that are responsive to transactivation, e.g., hypoxia responseelements, Gal4 response elements, lac repressor response element, andsmall molecule control systems such as tet-regulated systems and theRU-486 system (see, e.g., Gossen & Bujard. Tight control of geneexpression in mammalian cells by tetracycline-responsive promoters.Proc. Natl. Acad. Sci. USA 89:5547 (1992); Oligino et al. Drug inducibletransgene expression in brain using a herpes simplex virus vector. GeneTher. 5:491-496 (1998); Wang et al. Positive and negative regulation ofgene expression in eukaryotic cells with an inducible transcriptionalregulator. Gene Ther. 4:432-441 (1997); Neering et al. Transduction ofprimitive human hematopoietic cells with recombinant adenovirus vectors.Blood 88:1147-1155 (1996); and Rendahl et al., Regulation of geneexpression in vivo following transduction by two separate rAAV vectorsNat. Biotechnol. 16:757-761 (1998)). The MNDU3 promoter can also beused, and is preferentially active in CD34+ hematopoietic stem cells.

By “host” is meant a cell or organism that supports the replication of avector or expression of a protein or polypeptide encoded by a vectorsequence. Host cells may be prokaryotic cells such as E. coli, oreukaryotic cells such as yeast, fungal, protozoal, higher plant, insect,or amphibian cells, or mammalian cells such as CHO, HeLa, 293, COS-1,and the like, e.g., cultured cells (in vitro), explants and primarycultures (in vitro and ex vivo), and cells in vivo.

As used herein, the phrase “recombination proteins” includes excisive orintegrative proteins, enzymes, co-factors or associated proteins thatare involved in recombination reactions involving one or morerecombination sites (e.g., two, three, four, five, seven, ten, twelve,fifteen, twenty, thirty, fifty, etc.), which may be wild-type proteins(see Landy, Current Opinion in Biotechnology 3:699-707 (1993)), ormutants, derivatives (e.g., fusion proteins containing the recombinationprotein sequences or fragments thereof), fragments, and variantsthereof. Examples of recombination proteins include Cre, Int, IHF, Xis,Flp, Fis, Hin, Gin, Phi-C31, Cin, Tn3 resolvase, TndX, XerC, XerD, TnpX,Hjc, SpCCE1, and ParA.

A used herein, the phrase “recombination site” refers to a recognitionsequence on a nucleic acid molecule which participates in anintegration/recombination reaction by recombination proteins.Recombination sites are discrete sections or segments of nucleic acid onthe participating nucleic acid molecules that are recognized and boundby a site-specific recombination protein during the initial stages ofintegration or recombination. For example, the recombination site forCre recombinase is loxP which is a 34 base pair sequence comprised oftwo 13 base pair inverted repeats (serving as the recombinase bindingsites) flanking an 8 base pair core sequence (see FIG. 1 of Sauer, B.Site-specific recombination: developments and applications. Curr. Opin.Biotech. 5:521-527 (1994)). Other examples of recognition sequencesinclude the attB, attP, attL, and attR sequences described herein, andmutants, fragments, variants and derivatives thereof, which arerecognized by the recombination protein lambda phage Integrase and bythe auxiliary proteins integration host factor (IHF), Fis andexcisionase (lamda phage is).

As used herein, the phrase “recognition sequence” refers to a particularsequence to which a protein, chemical compound, DNA, or RNA molecule(e.g., restriction endonuclease, a modification methylase, or arecombinase) recognizes and binds. In the present invention, arecognition sequence will usually refer to a recombination site. Forexample, the recognition sequence for Cre recombinase is loxP which is a34 base pair sequence comprising two 13 base pair inverted repeats(serving as the recombinase binding sites) flanking an 8 base pair coresequence (see FIG. 1 of Sauer, B. Current Opinion in Biotechnology5:521-527 (1994)). Other examples of recognition sequences are the attB,attP, attL, and attR sequences which are recognized by the recombinaseenzyme lamda phage Integrase. attB is an approximately 25 base pairsequence containing two 9 base pair core-type Int binding sites and a 7base pair overlap region. attP is an approximately 240 base pairsequence containing core-type Int binding sites and arm-type Int bindingsites as well as sites for auxiliary proteins integration host factor(IHF), FIS and excisionase (lamda phage is). (See Landy, Current Opinionin Biotechnology 3:699-707 (1993).)

Throughout this document, unless the context requires otherwise, thewords “comprise,” “comprises” and “comprising” or “contain”, “contains”or “containing” will be understood to imply the inclusion of a statedstep or element or group of steps or elements but not the exclusion ofany other step or element or group of steps or elements.

A Modular TAL Designer Portal and Customized Services

Gene assembly and shuffling methods and related compositions, kits andprotocols including those described herein may be useful for anyoneskilled in the art to assemble or clone available DNA fragments.However, in certain instances it may be useful to order gene synthesisand/or related services from a commercial supplier, e.g., when a DNAsequence (e.g., a template for PCR amplification) is not available, aproject is complex or the skilled artisan is not sufficiently equippedto perform certain experiments or production steps. In some instances,gene synthesis services may be offered online via an order portal. Inone aspect, orders may be placed via a web-based platform designed toprovide customized gene synthesis services and/or specific products tocustomers. Gene synthesis services may include at least one or acombination of the following: design, optimization, synthesis, assembly,purification, mutagenesis, recombination, cloning, screening,expression, and/or analysis of nucleic acid molecules but may alsoinclude related services such as protein services, cell lineconstruction and testing, manufacturing, kit development or productcomposition, assay design and/or development, comparative analyses,detection or screening, project design and/or advisory service. Genesynthesis services may include in vitro as well as in vivo processes orapplications. In some aspects, gene synthesis services may includemethods and compositions related to DNA binding molecules. In someembodiments the web-based platform may include means to offer servicesor products related to DNA binding effector molecules, such as, forexample, TAL effectors.

Thus, in part, the invention relates to a web-based order portal forgene synthesis services which includes services related to DNA bindingmolecules, such as, e.g., TALs, zinc-finger nucleases or meganucleases.In one aspect, the order portal includes services related to TALproteins which may include customized services or products as well ascatalogue products. FIG. 1A and Example 1 provide one illustration ofhow such customer portal can be organized and how a customer can placean order with a service provider offering TAL-related services via aweb-based platform.

In some embodiments of the invention, the web-based order portal mayhave a modular organization. In certain embodiments the portal mayinclude at least one of the following: (i) a first module or webinterface (“module 1”), (ii) a second module or design engine (“module2”), and (iii) a third module or manufacture unit (“module 3”). Modules1, 2 and 3 as used herein shall be understood to represent specificfunctions as described in more detail below and it should therefore beunderstood by the skilled in the art that the respective functions mayalso be organized in a different way, e.g., in less or more than 3modules or under a different terminology. For example, in certaininstances, some of the functions described herein under module 2 beincluded in module 1 or 3 or vice versa or at least one of the modulesor several functions thereof may be incorporated in another module;e.g., module 2 per se may be part of module 1 etc. Other hierarchies ororganizations of the described functions are therefore included in theinvention.

Module 1 may serve as a platform for information exchange betweencustomer and service provider, to enter and store project-related andcustomer-specific information, and/or place an order. In someembodiments, module 1 provides at least one of the following features:means to enter and save customer information such as, e.g., contactdata, shipping address, billing information, customer ID, discountoptions etc.; means to select and order items from menus or lists, meansto enter and save customer project specifications, or means to enter adescription of material provided by a customer. Module 1 may furtherinclude pricing information for catalogue products or customizedprojects. In many instances, products or services may be designed andconstructed based upon anticipated customer needs and intended for sale,for example, as a “catalogue product”. However, in some instances thedesign and construction of deliverables such as synthetic genes will becustomized according to individual specifications.

One example of how module 1 can be organized, for example, to providecustomized services related to DNA binding molecules such as TALs isillustrated in FIG. 1B. In this example, customer information containsat least customer contact data and account-related information. Customercan enter project specifications such as a target sequence, a targetorganism, an effector function and can add additional specificationssuch as, for example, excluded motifs, cloning requirements and thelike. In certain instances customer may provide material such as, forexample, a plasmid or cells. In such case, customer may be asked toenter material specifications to describe the provided material.

The information stored or exchanged via module 1 may further be used,analyzed and/or processed by module 2. Module 2 may at least containinformation, components or means required for sequence and/or assemblydesign and may include or provide at least one of the following:database or access to database such as e.g., codon usage tables,sequence motif database, vector database, restriction enzyme database,effector sequence database, parts and/or devices database, code rules,host specifications, etc.; sequence analysis and optimization tools,means to perform sequence fragmentation, means for oligonucleotidedesign, means for encrypting a watermark or information in a sequence,means to develop an assembly strategy. One example of how module 2 canbe organized is illustrated in FIG. 1B. In some examples, module 2 iscapable of generating an assembly strategy. For example the output ofmodule 2 may contain an assembly strategy for a customized gene. Incertain instances, the output may comprise an assembly strategy for aDNA binding molecule such as a TAL effector as defined in more detailelsewhere herein. If a synthetic gene is to be designed module 2 maycomprise means to optimize the sequence of said gene. For this purposeit may be required to use information from proprietary or open sourcedatabase such as, for example, codon usage tables, or programsidentifying suitable restriction enzymes for production or identifyinginhibitory sequence motifs that may be excluded or programs taking intoaccount host-specific requirements. Module 2 may further include meansto edit a sequence or decompose a sequence into smaller parts to allowfor optimal synthesis, assembly and/or production of a molecule. It mayfurther include means to design oligonucleotides for synthesis. In someembodiments, module 2 may include a multi-parameter software taking intoaccount optimization and production requirements in parallel, such as,for example, a GENEOPTIMIZER® program as described in more detailelsewhere herein.

If a molecule representing a specific function is to be designed, thedesign may include database or information or rules specific to saidmolecule function. For example, if a DNA binding molecule is to bedesigned, module 2 may include at least a binding code table allocatingamino acids to specific nucleotides. In some embodiments, module 2 maybe organized to include information or rules related to TAL design(e.g., a TAL designer tool). Such “TAL designer” tool may include atleast one of the following: one or more (e.g., from about 2 to about 30,from about 4 to about 30, from about 8 to about 30, from about 10 toabout 30, from about 15 to about 30, from about 5 to about 50, etc.) TALcode tables as described elsewhere herein, means to apply TAL coderules, means to identify and select TAL effector related sequences,parts, domains etc. from a library or database; means to generate a TALeffector construct sequence or parts thereof and means to generate anassembly strategy for said construct. In certain instances it may beadvantageous to include specific motifs in a TAL sequence such as forexample a specific compartment targeting signal sequence where a TAL isto be targeted to a defined compartment within a cell. The TAL designermay receive the relevant information from a respective database. It mayfurther be useful to consider species-specific requirements. Forexample, a given compartment targeting signal may only be active in alimited amount of species and may therefore be inappropriate for certainhosts. The TAL designer tool may therefore include means to analyze aTAL design or TAL sequence for host compatibility. In one embodiment,the TAL designer may be equipped with a tool that can generate a modelof a protein or protein domain such as, e.g., a protein folding programproviding a three-dimensional model of a folded protein or structuraldata of said protein. In some embodiments, the tool can generate a modelof a protein in nucleic acid-bound and/or nucleic acid-freeconformation. Such data could be used to evaluate the bindingspecificity of TALs, the stability of protein-DNA or protein-proteininteraction and/or structural properties of engineered TAL repeats orTAL effector proteins. The tool may therefore include means to analyzethese data and identify accessible and/or inaccessible domains orresidues within a TAL effector. The results of such analysis may serveto indicate whether the engineered protein or domain would be suitablefor a specific application. For example, if the protein model suggeststhat an effector domain is not sufficiently exposed or shows aconstrained conformation, it may be required to include a flexiblelinker (e.g., a Gly-Ser linker) or insert, delete, modify, extend,truncate or shift certain sequence elements as, for example, spacersequences between domains. The TAL designer may further comprise meansto edit a modeled protein sequence (e.g., replace certain amino acidresidues by others) to modify the structure, function, bindingspecificity or activity of a protein. The editing function may beprovided as a separate program or may be incorporated in other programs,for example, as part of the protein modeling tool. Such function mayallow for in silico analysis and modification of engineered TAL proteinsresulting in an edited protein or amino acid sequence that can beback-translated into a nucleic acid sequence to obtain a template forsynthesis. Features specific for TAL design or incorporated in the TALdesigner may be linked with those features relevant for general genedesign. For example, the TAL designer may also access databaseinformation related to sequence optimization.

In one aspect of the invention module 2 may further include a toolcapable of designing a DNA binding molecule based on sequenceinformation. The sequence information may for example comprise aspecific target site, such as a TAL binding site and may be obtainedfrom a customer or some other source such as a database or from theliterature. In one aspect of the invention such tool may provide atleast one of the following: (i) means to analyze sequence information,(ii) means to access database information and select items or rulestherefrom, (iii) means to translate the rules into a protein design,(iv) means to back-translate the protein sequence into a nucleic acidsequence and/or (v) means to feed the information into a productionsystem. The function of sequence dependent design of a binding entitymay be provided as a separate program or may be incorporated in otherprograms, for example, as part of the TAL designer.

In one aspect, the invention relates to a TAL designer tool which iscapable of generating an assembly strategy for a two-step TAL assemblyprocess and which (i) provides access to a TAL repeat database andselects the required monomer building blocks; (ii) provides access to avector database and selects an effector sequence in combination with atarget vector sequence; (iii) defines the required triplets byallocation of the respective positions on a carrier (e.g., a 96 wellplate); (iv) determines the assembly strategy of the capture vectors(allocation of the required triplets, wherein the terminal overhangsdefine their position within the capture vectors); (v) generates thecomplete nucleic acid sequence of the TAL repeat domain; (vi) generatesthe nucleic acid sequences of both capture vectors; (vii) generates thenucleic acid sequence of the TAL effector open reading frame (ATG toStop); (viii) allows for importation of the sequences generated in oneor more of steps (v) to (vii) into a database controlling a productionprocess, and (ix) optionally generates a .gb file of the final TALfunctional vector sequence. The steps may be performed in the givenorder or may be performed in a different order. In one aspect, the TALdesigner tool may be presented by an excel-based program.

In some instances, module 2 may also provide means to transformindividual steps of a working process into pricing information. Forexample, the final pricing of a customized project may depend on theamount and/or complexity of steps required to produce a deliverable, thetime to perform the service, the costs of material, reagents orequipment used or employed to perform the services. For such purpose,module 2 may, for example, include means to process information fromlists of standardized items or stock keeping units (SKUs). In someinstances, some of the information and/or results obtained from module 2maybe re-directed to module 1. For example, pricing information relatedto a customer project generated in the context of project design maybecome accessible through the web interface of module 1. In someinstances, information and/or results generated by the means and methodsdescribed in module 2 maybe fed into or retrieved by a third module suchas a manufacture or production unit. In some embodiments, the sequenceinformation and/or design and/or assembly strategy generated in module 2can be translated into a production workflow operated by module 3. Thus,the invention includes processing of results or information obtainedfrom module 2 by a manufacture or production unit.

Module 3 may at least contain one of the following: (i) means tosynthesize nucleic acid molecules, (ii) means to assemble nucleic acidmolecules, (iii) means to clone or transfer nucleic acid molecules, (iv)one or more material repositories, (v) means to sequence nucleic acidmolecules, (vi) means to cultivate, propagate, and/or manipulate cells,(vii) means to analyze data, (viii) means to store biological material,(ix) a laboratory information management system. It is to be understoodthat the aforementioned means or the process steps performed by aproduction or manufacture unit can be performed in a different order orcan be separated between different sub-modules or production entitieswhich may be controlled or regulated together, separately orsequentially or may be interconnected. For example, means to synthesizeor assembly nucleic acid molecules may be timely and/or locallyseparated from means to manipulate cells.

In some embodiments, module 3 contains means to synthesize nucleic acidmolecules. Synthesis of nucleic acid molecules is usually based on acombination of organic chemistry and molecular biological techniques. Inone aspect, nucleic acid molecules such as genes, gene fragments, parts,vectors, plasmids, domains, variants, libraries etc. may be synthesized“de novo”, without the need for a template such as e.g., a given DNAtemplate. De novo synthesis may, for example, include chemical synthesisof oligonucleotides which can be combined and assembled to obtain largernucleic acid molecules, as, for example, described in Example 3. Inanother aspect, nucleic acid molecules may be obtained bytemplate-dependent methods known in the art such as, for example, by PCRamplification, mutagenesis, recombination or the like. In yet anotheraspect, pre-synthesized parts may be combined and connected to obtainnovel nucleic acid molecules. For example, nucleic acid parts orbuilding blocks may be taken from a library or material repository. Insome embodiments at least one step in the synthesis or assembly processmay be conducted on a solid support or solid phase or in a microfluidicenvironment. Gene synthesis services used in the method of the inventioncan relate to any of the above described approaches or combinationsthereof. In another aspect, one or more of the synthesis or assemblysteps may be performed on solid supports or in solution as required. Inyet another aspect, de novo synthesized nucleic acid molecules may becombined with template-derived nucleic acid molecules or may be combinedwith already available or pre-synthesized parts.

FIG. 1B illustrates an example of how module 3 (composed of differentsub-modules) may be organized to allow for manufacture of DNA bindingmolecules such as TALs. In one aspect, module 3 may contain a“GeneAssembler” module that coordinates nucleic acid synthesis andassembly according to the assembly strategy developed in module 2. AGeneAssembler module may have access to one or more materialrepositories which may contain, for example, reagents for genesynthesis, standardized parts, vectors, plasmids, nucleic acidlibraries, cloning tools, enzymes and/or enzyme cocktails and whererequired, means to store customer material and/or synthesis or assemblyderivatives. In some embodiments, the at least one material repositorycontains material related to TAL assembly such as TAL libraries orrepeat domain building blocks (e.g., monomers, dimers, trimers etc.),effector domains (e.g., nucleases, activators, repressors,(de)methylases, (de)acetylases etc.) or variants thereof (e.g., mutatedor truncated), TAL cloning or expression vectors etc. TAL-related toolsor parts as described elsewhere herein may be taken from a repository ormay be synthesized de novo and may be combined and/or assembled withother parts obtained from a material repository or synthesized de novoor provided by customer. For example, a TAL repeat domain may beassembled from a TAL trimer library, may be combined with a de novosynthesized effector domain and may be cloned into a vector provided bythe customer. However, all other combinations of available and de novosynthesized parts are possible and included in the concept of theinvention. In certain instances, de novo synthesized parts or partsobtainable from a repository may be combined to produce a novelcatalogue product, such as a vector (e.g., a TAL GATEWAY® vector) or acomposition. In yet another embodiment, the service provider may developan assembly strategy for a customer and compose a customized toolkitfrom repository parts and provide it to the customer for assembly (see,e.g., FIG. 7B).

A GeneAssembler module as used herein may employ different assemblytools and strategies and may incorporate in vitro and/or in vivoassembly approaches. For example, assembly may be performed using theinventive methods, compositions and/or tools described elsewhere herein.In some embodiments, a GeneAssembler module may employ at least one ofthe following assembly strategies: type II conventional cloning, typeIIS-mediated or “Golden Gate” cloning (see, e.g., Engler, C., R.Kandzia, and S. Marillonnet. A one pot, one step, precision cloningmethod with high throughput capability. PLos One 3: e3647 (2008.);Kotera, I., and T. Nagai. A high-throughput and single-tuberecombination of crude PCR products using a DNA polymerase inhibitor andtype IIS restriction enzyme. J Biotechnol 137:1-7. (2008); Weber, E., R.Gruetzner, S. Werner, C. Engler, and S. Marillonnet. Assembly ofDesigner TAL Effectors by Golden Gate Cloning. PloS One 6:e19722(2011)), GATEWAY® recombination, TOPO® cloning, exonuclease-mediatedassembly (Aslanidis and de Jong (Ligation-independent cloning of PCRproducts (LIC-POR); Nucleic Acids Research, Vol. 18, No. 20 6069(1990)), homologous recombination, non-homologous end joining or acombination thereof. Modular type IIS based assembly strategies are,e.g., disclosed in PCT Publication WO 2011/154147 the disclosure ofwhich is included herein by reference. A GeneAssembler module mayfurther comprise means for error correction of nucleic acid molecules.

Error correction can be performed either prior to assembly, betweenassembly steps or after assembly as required. One issue associated withnucleic acid synthesis, including chemical synthesis of nucleic acids,is errors in the sense that occasionally synthesized nucleic acidscontain an incorrect base.

Consider the following hypothetical. Nucleic acid molecules aregenerated with one error in every 100 nucleotides and a nucleic acidmolecule of 2000 nucleotides is assembled. This means that there willbe, on average, 20 errors per molecule. Errors in proteins codingregions can result in frame shifts, amino acid substitutions, orpremature stop codons. In order to obtain a coding sequence whichencodes a specified amino acid sequence two options are: (1) Sequencinga large number of nucleic acid molecules to identify ones without errorsOR (2) correct errors, then confirm sequence of a smaller number ofmolecules.

Error correction can be performed by any number of methods. Some suchmethods employ DNA binding enzymes which are capable of recognizingsequence errors or mismatches. For example, error correction methods maybe based on mismatch endonucleases known in the art (e.g., MutS, Cel1,Res1, Vsr, or Perkinsus marinus nuclease PA3, T4 endonuclease VII or T7endonuclease I).

Another method of error correction is set out in the following workflow. In the first step, nucleic acid molecules of a length smaller thanthat of the full-length desired nucleotide sequence (i.e., “nucleic acidmolecule fragments” of the full-length desired nucleotide sequence) areobtained. Each nucleic acid molecule is intended to have a desirednucleotide sequence that comprises a part of the full length desirednucleotide sequence. Each nucleic acid molecule may also be intended tohave a desired nucleotide sequence that comprises an adapter primer forPCR amplification of the nucleic acid molecule, a tethering sequence forattachment of the nucleic acid molecule to a DNA microchip, or any othernucleotide sequence determined by any experimental purpose or otherintention. The nucleic acid molecules may be obtained in any of one ormore ways, for example, through synthesis, purchase, etc.

In the optional second step, the nucleic acid molecules are amplified toobtain more of each nucleic acid molecule. The amplification may beaccomplished by any method, for example, by PCR. Introduction ofadditional errors into the nucleotide sequences of any of the nucleicacid molecules may occur during amplification.

In the third step, the amplified nucleic acid molecules are assembledinto a first set of molecules intended to have a desired length, whichmay be the intended full length of the desired nucleotide sequence.Assembly of amplified nucleic acid molecules into full-length moleculesmay be accomplished in any way, for example, by using a PCR-basedmethod.

In the fourth step, the first set of full-length molecules is denatured.Denaturation renders single-stranded molecules from double-strandedmolecules. Denaturation may be accomplished by any means. In someembodiments, denaturation is accomplished by heating the molecules.

In the fifth step, the denatured molecules are annealed. Annealingrenders a second set of full-length, double-stranded molecules fromsingle-stranded molecules. Annealing may be accomplished by any means.In some embodiments, annealing is accomplished by cooling the molecules.

In the sixth step, the second set of full-length molecules are reactedwith one or more endonucleases to yield a third set of moleculesintended to have lengths less than the length of the complete desiredgene sequence. The endonucleases cut one or more of the molecules in thesecond set into shorter molecules. The cuts may be accomplished by anymeans. Cuts at the sites of any nucleotide sequence errors areparticularly desirable, in that assembly of pieces of one or moremolecules that have been cut at error sites offers the possibility ofremoval of the cut errors in the final step of the process. In anexemplary embodiment, the molecules are cut with T7 endonuclease I, E.coli endonuclease V, and Mung Bean endonuclease in the presence ofmanganese. In this embodiment, the endonucleases are intended tointroduce blunt cuts in the molecules at the sites of any sequenceerrors, as well as at random sites where there is no sequence error.

In the last step, the third set of molecules is assembled into a fourthset of molecules, whose length is intended to be the full length of thedesired nucleotide sequence. Because of the late-stage error correctionenabled by the provided method, the set of molecules is expected to havemany fewer nucleotide sequence errors than can be provided by methods inthe prior art.

The process set out above is also set out in U.S. Pat. No. 7,704,690,the disclosure of which is incorporated herein by reference.

Another process for effectuating error correction in chemicallysynthesized nucleic acid molecules is by a commercial process referredto as ERRASE™ (Novici Biotech). Error correction methods and reagentsuitable for use in error correction processes are set out in U.S. Pat.Nos. 7,838,210 and 7,833,759, U.S. Patent Publication No. 2008/0145913A1 (mismatch endonucleases), and PCT Publication WO 2011/102802 A1, thedisclosures of which are incorporated herein by reference.

Exemplary mismatch endonucleases include endonuclease VII (encoded bythe T4 gene 49), T7 endonuclease I, Res1 endonuclease, Cellendonuclease, and SP endonuclease or methyl-directed endonucleases suchas MutH, MutS or MutL. The skilled person will recognize that othermethods of error correction may be practiced in certain embodiments ofthe invention such as those described, for example, in U.S. PatentPublication Nos. 2006/0127920 AA, 2007/0231805 AA, 2010/0216648 A1,2011/0124049 A1 or U.S. Pat. No. 7,820,412, the disclosures of which areincorporated herein by reference.

Another schematic of an error correction method is shown in FIG. 7.

Synthetically generate nucleic acid molecules typically have error rateof about 1 base in 300-500 bases). Further, in many instances, greaterthan 80% of errors are single base frameshift deletions and insertions.Also, less than 2% of error result from the action of polymerases whenhigh fidelity PCR amplification is employed. In many instances, mismatchendonuclease (MME) correction will be performed using fixed protein:DNAratio.

In another embodiment, error correction may be performed indirectly,e.g., by selecting correct nucleic acid molecules or eliminatingincorrect nucleic acid molecules from a mixture or library of nucleicacid molecules. In one aspect the correction may include negativeselection of frameshift mutations and may for example employframe-dependent reporter expression to identify correct constructs suchas, e.g., disclosed in published U.S. Patent Publication No.2010/0297642 AA, the disclosure of which is included herein byreference. A GeneAssembler module may further contain sequencing meansto determine the sequence of synthesized or assembled nucleic acidmolecules. Sequencing may be applied to fragments and/or full-lengthgenes. A GeneAssembler module should be equipped with all devicesrequired to perform the described workflows including reagents andmaterial (e.g., chemicals, enzymes, solvents, media, cells, consumablesetc.), machines (e.g., oligonucleotide synthesizer, PCR-cycler,sequencer, incubator, clone picker, HPLC) and/or computer programs andanalysis tools.

In one aspect of the invention, protein expression may be performed bythe service provider as part of the service. In another aspect, thecustomer may order a construct and an expression kit and the expressionmay be performed by the customer. In cases where customer requestsexpression or protein services, module 3 may further contain an“Express” module. Where protein services are directed to TALs, arespective TAL-Express module may be provided which may at least includemeans for delivery of TAL constructs, means for TAL expression, means tocultivate and manipulate TAL host cells, means for protein extraction orpurification and/or reporter systems. In some embodiments, TAL-Expressoffers different vectors or delivery systems to transfect host cells ortarget TALs to specific compartments. In particular, TAL-Express mayemploy the delivery systems or expression systems as described elsewhereherein. Furthermore, a TAL-Express module may include differentexpression systems or host cells such as bacteria, algae, yeast, fungi,plant, mammalian or human cells or cell cultures.

In cases where expression is performed by customer a TAL construct maybe delivered together with an expression kit. Different expressionsystems or kits are known in the art and may be chosen from the serviceprovider's order portal or catalogue such as bacterial expressionstrains or expression kits (e.g., BL21 STAR™ based CHAMPION™ pETExpression System from Life Technologies (Carlsbad, Calif.), algaeexpression kits (e.g., GENEART ® Chlamydomonas Engineering Kit, GENEART® Synechococcus Engineering Kit), or mammalian cell lines allowing forstable integration of an ordered construct and efficient expression froma transcriptionally active genomic locus (e.g., FLP-IN™ or Jump-Inmammalian cells). In some instances, customer may want to order deliverytools from service provider to deliver an ordered construct into acertain cell type. For example, TAL constructs may be efficientlydelivered to non-dividing or diving mammalian cells by using anadenoviral-based expression system (e.g., ViraPower™ Adenoviral GATEWAY®Expression Kit offered by Life Technologies (Carlsbad, Calif.)).

In some instances, a cell-free TAL expression may be employed. Cell-freeprotein production can be accomplished with several kinds and species ofcell extracts such as E. coli lysates (e.g., Expressway™ Maxi Cell-FreeE. coli Expression System), rabbit reticulocyte lysates (RRL), wheatgerm extracts, insects cell (such as SF9 or SF21) lysates, or extractswith human translation machinery. For such purpose, service provider mayoffer a selection of cell-free expression kits to be ordered togetherwith the gene synthesis service. However, in certain embodiments,cell-free expression may also be employed by service provider in thecontext of protein services.

In another aspect, module 3 may contain means to analyse the function orstructure or correctness of deliverables or manufacture intermediates.Respective analyses may be routinely performed by service provider forquality control (QC) purposes. For example, where a synthetic gene hasbeen manufactured for a customer, QC analysis would at least includeevaluation of sequence correctness, e.g., by sequencing of said gene. Incertain instances, where TAL services are offered, module 3 may includea “TAL Analyzer” module that performs additional experiments or analysesto validate the manufactured products. A TAL Analyser may e.g., includereporter assays to analyse TAL repeat integrity, TAL bindingspecificity, TAL function, TAL structure, TAL activity, effectoractivity, TAL expression etc. In particular, TAL Analyzer may employ thereporter assays and analysis tools as described elsewhere herein.Different options for reporter-based analysis of TAL constructs may beprovided. In a first embodiment, a reporter kit may be provided ascatalogue product and may be ordered by customer together with TALservices. In another embodiment, a reporter-based analysis of TALfunction etc. may be offered as extra service. In such case, thereporter assay or analysis would be performed by the service providerand customer would obtain the results of the assay. In a thirdembodiment, customer may order a customized reporter assay for TALanalysis developed by service provider. Different options may becombined and offered for selection in the order portal.

Optionally, some or all of the steps or workflows summarized in module 3may be controlled or interconnected by a software-based LaboratoryInformation Management System (LIMS) that offers features to supportlaboratory operations. Such features may include workflow and datatracking and may provide data exchange interfaces connecting workflowsof different modules or production steps. A LIMS may further integratedata mining or assay data management and may provide numerous softwarefunctions such as, e.g., the reception and log in of a sample and itsassociated customer data; the assignment, scheduling, and tracking ofthe sample (e.g., via a barcode) and the associated workload; theprocessing and quality control associated with the sample and theutilized equipment, the storage of data associated with the sampleand/or the inspection, approval, and compilation of the sample data forreporting and/or further analysis.

Deliverables resulting from the methods and processes summarized inmodule 3 will be shipped or transferred to customer. Deliverables mayinclude material such as nucleic acid molecules, proteins, cells, kitsor compositions. Deliverables may further include data such as sequenceinformation, service reports, assay results or QC documents which mayeither be shipped together with material, separately or may be provided,e.g., via email, or a web interface (e.g., the interface of module 1).

TAL Effector Sequence Design

The methods and compositions described herein can be applied to anymodular DNA binding effector molecule but may be particularly useful forengineered TAL effector systems. In one aspect, the invention relates tothe generation of engineered TAL effectors with improved nucleic acidbinding cassettes wherein the cassettes have been optimized for (i)increased expression in a target host and/or (ii) increased specificityfor a defined target sequence. In another aspect, the invention relatesto the generation of engineered effector fusions wherein the effectorfusions can be optimized for (i) increased expression in a target hostand/or (ii) increased activity towards a defined target sequence. Thus,in one aspect, the invention includes methods of designing TAL effectorproteins and TAL effector coding nucleic acid sequences for optimalperformance in downstream applications.

In certain embodiments of the invention the selected TAL effectornucleic acid sequence or a portion thereof may be subject to a sequenceoptimization process prior to synthesis. The optimization process can bedirected to the nucleic acid sequence encoding the TAL binding domain orthe nucleic acid sequence encoding the TAL effector fusion or caninclude sequence optimization of both moieties and if applicable, caninclude optimization of additional spacer, adapter, linker or tagsequences contributing to the TAL effector entity. The optimization ofdifferent parts of the TAL effector nucleic acid sequence can occureither sequentially or simultaneously. Different computationalapproaches for sequence modification are known in the art and may beemployed to optimize a given nucleotide sequence in terms of (1)efficient assembly and/or (2) improved performance in a given host.

To design a nucleotide sequence for optimal assembly, a full-lengthsequence may be broken down into a defined number of smaller fragmentswith optimal hybridization properties by means of an algorithm takinginto account parameters such as melting temperature, overlap regions,self-hybridization, absence or presence of cloning sites and the like.In certain aspects of the invention, it may be desired to use anoptimization strategy that takes into account multiple differentparameters simultaneously including assembly—as well asexpression-related sequence properties. Algorithms for designingcodon-optimized coding sequences are known in the art. One example of acomprehensive multiparameter approach that may be used in the currentinvention for optimized sequence design is the GENEOPTIMIZER® technologydescribed in U.S. Patent Publication No. 2007/0141557 AA, the disclosureof which is incorporated herein by reference

In certain embodiments of the invention, it may be desirable to optimizethe TAL effector nucleic acid sequence for improved performance in agiven homologous or heterologous host, to trigger, e.g., expressionyield, activity or solubility. In this context codon optimization wasproven to be an efficient tool to increase expression yields in manydifferent species such as, e.g., plants including algae such, bacteria,yeast, insect cells or mammalian cells (such as human cells), etc. Bycodon optimization is meant to replace codons by synonymous codonswherein the term “synonymous codon” as used herein refers to a codonhaving a different nucleotide sequence than another codon but encodingthe same amino acid as that other codon. The codon usage of a given geneor gene fragment may e.g., be adapted to the codon choice of theorganism in which it shall be expressed. The codon usage can varysignificantly for different expression systems including the most widelyused viral (retro- and lentiviral, AAV, Adeno, Baculo, Sindbis,Vaccinia), bacterial (e.g., E. coli, B. subtilis, L. lactis), yeast(e.g., S. cerevisiae, S. pombe, P. pastoris), fungal (e.g., A. niger, A.oryzae, A. awamori, Fusarium, Trichoderma sp, Penicillium sp.), insect(e.g., Spodoptera frugiperda Sf9, Sf21, Drosophila melanogaster S2;Trichoplusia ni High Five™), plant (e.g., Agrobacterium tumefaciens,Nicotiana tobaccum), algae (e.g., P. tricornutum, C. reinhardtii,Synechococcus elongates, Chlorella vulgaris), mammalian (e.g., CHO, 3T3cells) or human (e.g., H1299, 293, PERC6, cells) expression systems.Genomic codon usage tables for various species are available in thecodon usage database at http://www.kazusa.or.jp/codon/ including codonusage tables for chloroplasts and mitochondria. Two exemplary codonusage tables reflecting the genomic codon usage of C. reinhardtii (TABLE2) and the chloroplast codon usage of C. reinhardtii (TABLE 3) are shownbelow:

TABLE 2 Genomic Codon Usage of C. reinhardtii Fields: Triplet -Frequency: per Thousand - (. . . Number) UUU 5.0 (2110) UCU 4.7 (1992)UAU 2.6 (1085) UGU 1.4 (601) UUC 27.1 (11411) UCC 16.1 (6782) UAC 22.8(9579) UGC 13.1 (5498) UUA 0.6 (247) UCA 3.2 (1348) UAA 1.0 (441) UGA0.5 (227) UUG 4.0 (1673) UCG 16.1 (6763) UAG 0.4 (183) UGG 13.2 (5559)CUU 4.4 (1869) CCU 8.1 (3416) CAU 2.2 (919) CGU 4.9 (2071) CUE 13.0(5480) CCC 29.5 (12409) CAC 17.2 (7252) CGC 34.9 (14676) CUA 2.6 (1086)CCA 5.1 (2124) CAA 4.2 (1780) CGA 2.0 (841) CUG 65.2 (27420) CCG 20.7(8684) CAG 36.3 (15283) CGG 11.2 (4711) AUU 8.0 (3360) ACU 5.2 (2171)AAU 2.8 (1157) AGU 2.6 (1089) AUC 26.6 (11200) ACC 27.7 (11663) AAC 28.5(11977) AGC 22.8 (9590) AUA 1.1 (443) ACA 4.1 (1713) AAA 2.4 (1028) AGA0.7 (287) AUG 25.7 (10796) ACG 15.9 (6684) AAG 43.3 (18212) AGG 2.7(1150) GUU 5.1 (2158) GCU 16.7 (7030) GAU 6.7 (2805) GGU 9.5 (3984) GUC15.4 (6496) GCC 54.6 (22960) GAC 41.7 (17519) GGC 62.0 (26064) GUA 2.0(857) GCA 10.6 (4467) GAA 2.8 (1172) GGA 5.0 (2084) GUG 46.5 (19558) GCG44.4 (18688) GAG 53.5 (22486) GGG 9.7 (4087)

TABLE 3 Chloroplast Codon Usage of C. reinhardtii Fields: Triplet -Frequency: per Thousand - (. . . Number) UUU 33.4 (894) UCU 17.0 (455)UAU 24.6 (657) UGU 7.6 (203) UUC 17.1 (456) UCC 2.8 (74) UAC 10.0 (266)UGC 1.5 (39) UUA 77.7 (2078) UCA 22.0 (588) UAA 2.9 (78) UGA 0.1 (3) UUG4.3 (114) UCG 4.0 (107) UAG 0.4 (12) UGG 13.5 (361) CUU 14.3 (383) CCU15.5 (414) CAU 10.1 (270) CGU 32.4 (866) CUC 1.0 (28) CCC 3.4 (90) CAC8.8 (235) CGC 4.1 (110) CUA 6.4 (170) CCA 23.6 (630) CAA 38.4 (1026) CGA3.4 (90) CUG 3.7 (99) CCG 2.4 (63) CAG 4.1 (110) CGG 0.5 (14) AUU 51.4(1374) ACU 24.4 (651) AAU 42.1 (1126) AGU 16.0 (428) AUC 8.2 (219) ACC5.1 (135) AAC 17.7 (472) AGC 5.4 (144) AUA 6.9 (184) ACA 32.4 (865) AAA69.1 (1847) AGA 5.3 (143) AUG 22.3 (596) ACG 3.9 (103) AAG 6.2 (167) AGG0.9 (23) GUU 29.3 (783) GCU 34.0 (908) GAU 25.3 (676) GGU 44.0 (1177)GUC 2.5 (68) GCC 5.9 (159) GAC 9.8 (263) GGC 6.4 (172) GUA 26.0 (696)GCA 20.7 (554) GAA 41.1 (1098) GGA 8.6 (229) GUG 5.6 (149) GCG 3.3 (88)GAG 5.7 (152) GGG 3.7 (99)

Thus, in one aspect, the invention relates to optimized TAL effectorexpression constructs and methods to achieve the best possible designfor a given target host. An increase in gene expression may be achieved,for example, by replacing non-preferred or less preferred codons by morepreferred codons or non-preferred codons by more preferred andless-preferred codons with regard to a specific host system therebytaking advantage of the degenerate genetic code without modifying theencoded amino acid sequence. Methods of producing synthetic genes withimproved codon usage are, e.g., described in U.S. Pat. Nos. 6,114,148and 5,786,464 the disclosures of which is incorporated herein byreference. Alternatively, it may be sufficient to only modify orrandomize the initial 5′ codons of a given sequence or open readingframe as, e.g., described in WO2009/113794. In another embodiment, thecodon adaptation strategy may be such as to modify codons that are over-or underrepresented in genomic sequences, eliminate only random codonsor certain motifs (such as, e.g., AGG in viral sequences) and harmonizethe distribution of other codons over the entire sequence. For examplethe GC content may be harmonized to allow for correct folding ofcomplex, modular or repetitive protein motifs. Also, a combination ofdifferent optimization strategies may be ideal to achieve the besteffect for a given TAL or TAL effector sequence. In some methods of theinvention, codon optimization can be applied to (i) the TAL cassettesand/or (ii) TAL repeats as a whole, and/or (iii) the N- and C-terminalflanking regions and/or (iv) effector fusion encoding sequences.However, in certain aspects it may in addition be useful to optimizeother upstream or downstream located sequences.

In some embodiments of the invention at least all sequences expressed ina target host have been subject to codon-optimization. In certainaspects of the invention it may, however, be useful to optimize orde-optimize only one or two of the above listed domains or only aproportion thereof. For example in certain embodiments of the invention,one or more of the sequences to be expressed have been codon-optimizedby at least 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%,80%, 85%, 90%, 95%, 99% or more with regard to a given target hostsystem. In one embodiment the host system is an algae system and one ormore sequences to be expressed have been optimized based on the codonusage preferences listed in TABLES 2 or 3. In one embodiment a TALeffector sequences reflects the codon usage of one or more algalchloroplasts.

In another aspect of the invention, it may be useful to decrease thenumber of optimized codons in a given sequence as a means for loweringexpression levels. For example, it may be useful to decrease expressionlevel of a certain expression product as compared to another expressionproduct in order to balance interaction or activity of both products. Inanother aspect, different optimization strategies can be applied e.g.,to different gene products expressed together in a host cell. Forexample, a functional vector comprising a TAL effector and more than oneeffector fusion sequences may be designed such that one of the effectorsequences has been optimized for increased expression whereas anothereffector sequence has been de-optimized to limit expression levels.Optimization can thus be used to trigger a defined production ratio ofexpressed gene products thereby modulating their activity. This strategymay for example be particularly useful where TALs are being used asscaffolds to arrange enzyme activities for a given biosynthetic pathwayon a DNA template—as described in more detail elsewhere herein. In suchcases, it may be required that different enzymes are subject todifferent codon optimization strategies to achieve different expressionlevels or a rational expression balance for the best possible interplay.

In another aspect of the invention, a multi-organisms optimization orde-optimization approach may be applied as, e.g., codons may be selectedto allow for (i) expression in more than one specific host organism or(ii) expression in one organisms but not in another or (iii) a blend ofoptimized codons for improved expression in two or more organisms). Forexample, a codon choice may be used that allows for a TAL effector to beefficiently expressed in yeast and algae but not in E. coli or inanother example a codon choice may be used to allow for expression inmammalian as well as insect systems. Thus, in some embodiments, theinvention relates to engineered TAL effector sequences exhibiting acodon choice that is compatible with a first species and at least asecond species wherein the TAL effector sequence can be expressed atdetectable levels in the first and at least the second species. Inanother embodiment, the invention relates to an engineered TAL effectorsequence exhibiting a codon choice that is compatible with a firstspecies but is not compatible with at least a second species wherein theTAL effector sequence can be expressed at detectable levels in the firstspecies but cannot be expressed at detectable levels in at least asecond species.

Apart from translational effects, codon usage may influence multiplelevels of RNA metabolism and has also been shown to influencetranscriptional regulation. For example, the expression of a gene can bemodulated by modifying the number of CpG dinucleotides in the openreading frame as described in U.S. Patent Publication No. 2009/0324546AA, the disclosure of which is incorporated herein by reference. In thiscontext it was demonstrated that an increase of the intragenic CpGcontent can further augment expression yields as compared to a“conventionally codon-optimized” gene mainly by triggering de novotranscription rates whereas a decrease of the intragenic CpG content hasthe contrary effect. Thus, in one aspect the invention relates to TALeffector sequences wherein at least (i) the TAL repeat domain and/or(ii) the N- and C-terminal flanking regions and/or (iii) effectorencoding sequences comprise an increased CpG dinucleotide content toincrease expression or a decreased intragenic CpG dinucleotide contentto decrease expression.

The above described strategies may further be combined to modulate theimmunogenicity of gene products. It may for example be desired tominimize the immunogenicity of a TAL effector for therapeuticapplication in a mammalian or human host. For example, PCT PublicationWO 2009/049359 A1, the disclosure of which is included herein byreference, discloses methods of modulating the quality of an immuneresponse to a target antigen in a mammal wherein the quality ismodulated by replacing at least one codon of the polynucleotide with asynonymous codon that has a higher or lower preference of usage by themammal to confer the immune response than the codon it replaces. Theranking of codons mediating increased expression is not necessarilyidentical with the ranking of codons mediating an increased immuneresponse. Thus, in a further aspect of the invention, replacement bysynonymous codons may be applied to change the immunogenicity of aheterologous DNA binding effector molecule such as zinc-finger nuclease,meganucleases or TAL effector molecules in a target host system byreplacing codons with a higher ranking to confer an immune response bysynonymous codons with a lower ranking to confer a lower immune responsein a mammalian or human host.

Thus, the invention relates, in part, to DNA binding effector moleculesoptimized for increased expression in mammalian organisms which at thesame time having decreased immunogenicity in said mammalian host. In oneembodiment, the invention relates to a DNA binding effector sequencewherein at least one of (i) the DNA binding domain sequence or (ii) atleast one effector domain sequence has been codon-optimized forexpression in mammalian cells and wherein the codon-optimization takesinto account selecting synonymous codons that have a lower immuneresponse preference, wherein at least one codon may be replacedaccording to the following scheme to decrease immunogenicity of theoptimized sequence: GCT by GCG or GCA or GCC; GCC by GCG or GCA; CGA byAGG or CGG; CGC by AGG or CGG; CGT by AGG or CGG; AGA by AGG or CGG; AACby AAT; GAC by GAT; TGC by TGT; ATC by ATA or ATT; ATT by ATA; CTG byCTA or CTT or TTG or TTA; CTC by CTA or CTT or TTG or TTA; CTA by TTG orTTA; CTT by TTG or TTA; TTG by TTA; TTT by TTC; CCC by CCT; TCG by TCTor TCA or TCC or AGC or AGT; TCT by AGC or AGT; TCA by AGC or AGT; TCCby AGC or AGT; ACG by ACC or ACA or ACT; ACC by ACA or ACT; ACA by ACT;TAC by TAT; GAA by GAG; GGA by GGC or GGT or GGG; CCC by CCA or CCG; CCTby CCA or CCG; GTG by GTT or GTA; GTC by GTT or GTA; GTT by GTA.

A DNA binding effector sequence optimized as described above, whereinthe DNA binding effector is a zinc-finger nuclease, a TAL effector, aTAL epigenetic modifier or a meganuclease.

In specific embodiments, an optimized open reading frame may be combinedwith an algorithm to encrypt a secret message into the open readingframe as described in U.S. Patent Publication No. 2011/0119778 AA thedisclosure of which is incorporated herein by reference. Such messagemay allow the identification or tracking of certain synthetic nucleicacid molecules encoding DNA binding effector molecules. In certainaspects of the invention the encrypted message may be included in theTAL effector sequence and may serve to identify transfected orgenetically engineered cells such as mammalian cells, yeast cells, algaeor microalgae or other engineered plants, plant seeds or crops. In someembodiments, the encrypted message is included in either the TAL bindingdomain or at least one effector domain encoding sequence. The messagecan be inserted without changing the amino acid sequence of the effectordomain making use of the degenerate genetic code as described e.g., inU.S. Patent Publication No. 2011/0119778 AA the disclosure of which isincorporated herein by reference. Thus, the invention also relates toDNA binding effector molecules such as zinc-finger nucleases,meganucleases, TAL effectors or TAL epigenetic modifier encodingsequences containing an encrypted message. Furthermore the inventionrelates to TAL effector construct wherein at least part of the TALeffector sequence has been codon-adapted to an algae, plant or mammalianexpression system and the effector fusion harbors a secret massageencrypted according to the method as described in U.S. PatentPublication No. 2011/0119778 AA.

Engineered TAL Effectors

The TAL code. Natural TAL effectors are usually composed of anamino-terminal moiety (N-terminus), a central array comprising multipleamino acid repeats with hypervariable RVD that determine base preferenceand a carboxyl-terminal portion (C-terminus) comprising a nuclearlocalization signal (NLS) and a transcription activator (AD) domain thelatter of which can be replaced by any effector domain. In manyinstances, the central amino acid repeats are between 32 and 35 aminoacids in length, with amino acid variations at positions 12 and 13determining base specificity of the particular repeat. Based on themodular TAL structure the central repeats can be synthesized separatelyto be assembled into a given TAL (see. FIG. 2A) which allows forefficient genetic engineering of TAL effectors with novel function.

As noted above, a distinctive characteristic of TAL effectors is acentral repeat domain containing between 1.5 and 33.5 TAL cassettes thatare usually around 34 residues in length (the C-terminal cassette isgenerally shorter and referred to as a “half repeat”). A typicalsequence of a naturally occurring cassette is LTPEQVVAIASHDGGKQALETVQRLLPVLCQAHG (SEQ ID NO: 90), with hypervariable residues at positions12 and 13.

The amino acid sequences of TAL repeats can vary to some extent withinthe same protein and between proteins. An alignment of TAL repeats fromproteins obtained from different bacteria is shown in FIG. 2B. Therepeats contain several conserved regions. TAL repeats of the inventionmay contain one or more of the amino acid sequences set out in TABLE 4.

TAL Amino Acid Sequence LTPDQVVAIASN1GGKQALETVQRLLPVLCQAHG(SEQ ID NO: 91) LTPDQVVAIAS21GGKQALETVQRLLPVLCQAHG (SEQ ID NO: 92)LTP3QVVAIA4 (SEQ ID NO: 93) GGK5AL6 (SEQ ID NO: 94) Legend:“1” is G, I, D, T, N, K, or “4” is S, A, or N no amino acid“5” is P or Q “2” is I, N, H, or Y “6” is E or G “3” is D, A, E, Q, or N

The primary amino acid sequence of a TAL repeat domain dictates thenucleotide sequence to which it binds. The crystal structure of a TALeffector bound to DNA suggests that each TAL cassette comprises twoalpha helices and a short RVD-containing loop where the second residueof the RVD at position 13 makes sequence-specific DNA contacts while thefirst residue of the RVD at position 12 stabilizes the RVD-containingloop (Deng, D et al., Structural Basis for Sequence-Specific Recognitionof DNA by TAL Effectors”. Science 335 (6069): 720-723 (2012)). Targetsites of TAL effectors also tend to include a T flanking the 5′ basetargeted by the first repeat and this appears to be due to a contactbetween this T and a conserved Tryptophan in the region N-terminal ofthe central cassettes. Because the specific relationship between the TALamino acid sequence and the target binding site, target sites can bepredicted for TAL effectors, and TAL effectors also can be engineeredand generated for the purpose of binding to particular nucleotidesequences.

TAL effectors have been shown to bind to DNA duplexes as well as DNA-RNAhybrids, wherein binding is in each case believed to be determined bythe DNA forward strand (Yin et al. Specific DNA-RNA Hybrid Recognitionby TAL Effectors. Cell Reports 2, 707-713 (2012)). Therefore, as usedherein a target site or TAL binding site can be provided in the contextof a DNA double strand or a DNA-RNA hybrid.

Thus, the invention relates, in part, to TAL effectors wherein each TALnucleic acid binding cassette is responsible for recognizing one basepair in the target DNA sequence (wherein the target DNA sequence may beprovided in the context of a double stranded DNA or a DNA-RNA hybridmolecule), and wherein the RVD comprises one or more of: HD forrecognizing C; NG for recognizing T; NI for recognizing A; NN forrecognizing G or A; NS for recognizing A or C or G or T; N* forrecognizing C or T, where * represents a gap in the second position ofthe RVD; HG for recognizing T; H* for recognizing T, where * representsa gap in the second position of the RVD; IG for recognizing T; NK forrecognizing G; HA for recognizing C; ND for recognizing C; HI forrecognizing C; HN for recognizing G; NA for recognizing G; SN forrecognizing G or A; and YG for recognizing T. Each DNA binding cassettecan comprise a RVD that determines recognition of a base pair in thetarget DNA sequence, wherein each DNA binding cassette is responsiblefor recognizing one base pair in the target DNA sequence, and whereinthe RVD comprises one or more of: HA for recognizing C; ND forrecognizing C; HI for recognizing C; HN for recognizing G; NA forrecognizing G; SN for recognizing G or A; YG for recognizing T; and NKfor recognizing G, and one or more of: HD for recognizing C; NG forrecognizing T; NI for recognizing A; NN for recognizing G or A; NS forrecognizing A or C or G or T; N* for recognizing C or T, wherein *represents a gap in the second position of the RVD; HG for recognizingT; H* for recognizing T, wherein * represents a gap in the secondposition of the RVD; and IG for recognizing T.

In certain instances it may be required to target a methylated nucleicacid sequence or a methylated chromatin region of a cell. DNA is usuallymethylated by DNA methyltransferase at the C5 position of cytosine oftenin the context of a CpG dinucleotide motif resulting in 5-methylcytosine(mC). It has been found that between 60% and 90% of all CpGs aremethylated in mammalian and plant somatic or pluripotent cells whereasmost unmethylated CpGs are grouped in clusters referred to as CpGislands which are present in the 5′ regulatory regions of many genes.DNA methylation is important for regulation of gene transcription andgenes with high levels of mC in their promoter region aretranscriptionally silent. It was found that methylated DNA can bespecifically recognized by TAL effectors via RVDs NG and N* (Deng et al.Recognition of methylated DNA by TAL effectors. Cell Research 22:1502-1504 (2012); Valton et al. Overcoming TALE DNA Binding DomainSensitivity to Cytosine Methylation. J. Biol. Chem. 287: 38427-38432(2012)). NG usually recognizes T whereas N* binds both, T and C. It wasshown that both RVDs additionally bind mC. Thus, whereas N* may beincluded in TAL effectors in those instances where both cytosinevariants (mC and C) are to be recognized, NG (which recognizes only mCbut not C) may be used to distinguish methylated from un-methylatedsequences. Thus, the invention further relates, in part, to TALeffectors wherein each TAL nucleic acid binding cassette is responsiblefor recognizing one base pair in the target DNA sequence, and whereinthe RVD comprises one or more of: NG or N* for recognizing mC.

The invention thus includes TAL effectors that recognize methylatednucleic acids, as well as methods for bringing TAL effector fusionproteins with various biological activities in contact with such nucleicacids. Exemplary activities include methylation and demythylationactivities. The invention thus includes compositions and methods foraltering the methylation state of nucleic acids. In one aspect, theinvention includes methods for altering the methylation state of nucleicacid molecules in cells comprising contacting the cells with one or morenucleic acid molecules encoding non-naturally occurring fusion proteincomprising an artificial transcription activator-like (TAL) effectorrepeat domain, for example, of contiguous repeat units 33 to 35 aminoacids in length and a methylation state modification activity (e.g.,methylation or demethylation), wherein the repeat domain is engineeredfor recognition of a predetermined nucleotide sequence, wherein thefusion protein recognizes the predetermined nucleotide sequence, andwherein the fusion protein is expressed in the cells.

The invention thus includes methods for altering the methylation stateof specific regions of nucleic acid molecules, for example, in cells.This includes the conversion of hemimethylated nucleic acids to fullymethylated or fully demethylated nucleic acids. Thus, the inventionincludes methods for converting hemimethylated nucleic acids to fullymethylated or fully demethylated nucleic acids, as well as compositionsof matter for performing such methods.

Exemplary methylases that may be used in the practice of the inventionare described elsewhere herein.

In some aspects, TAL cassettes may be assembled from single cassettes ormonomers and a library of monomers may be provided representing at leastfour different categories wherein at least one category encodes an RVDto bind A, at least one category encodes an RVD to bind G, at least onecategory encodes an RVD to bind C and at least one category encodes anRVD to bind T, wherein the RVDs binding A, G, C or T may be chosen fromthe aforementioned list.

The target site bound by a TAL effector or TAL effector fusion can meetat least one of the following criteria: (i) is a minimum of 15 baseslong and is oriented from 5′ to 3′ with a T immediately preceding thesite at the 5′ end; (ii) does not have a T in the first (5′) position oran A in the second position; (iii) ends in T at the last (3′) positionand does not have a G at the next to last position; and (iv) has a basecomposition of 0-63% A, 11-63% C, 0-25% G, and 2-42% T.

In another aspect, an engineered TAL effector may be designed toincorporate a nucleic acid encoding a variant 0th DNA binding cassettewith specificity for A, C, or G, thus eliminating the requirement for Tat position −1 of the target site.

TAL Repeat Structures—Burkholderia-Derived

Burkholderia TAL-Like Amino Acid Sequences: Hypothetical proteinRBRH_01844 of Burkholderia rhizoxinica HKI 454 has the following aminoacid sequence in which standard one-letter amino acid abbreviations areused (GenBank Accession No. YP_004022479.1) (SEQ ID NO:48).

  1 mstafvdqdk qmanrlnlsp lerskiekqy ggattlafis nkqnelaqil sradilkias 61 ydcaahalqa vldcgpmlgk rgfsqsdivk iagniggaqa lqavldlesm lgkrgfsrdd121 iakmagnigg aqtlqavldl esafrergfs qadivkiagn nggaqalysv ldveptlgkr181 gfsradivki agntggaqal htvldlepal gkrgfsridi vkiaanngga qalhavldlg241 ptlrecgfsq atiakiagni ggaqalqmvl dlgpalgkrg fsqatiakia gniggaqalq301 tvldlepalc ergfsqatia kmagnnggaq alqtvldlep alrkrdfrqa diikiagndg361 gaqalqavie hgptlrqhgf nladivkmag niggaqalqa vldlkpvlde hgfsqpdivk421 magniggaqa lqavlslgpa lrergfsqpd ivkiagntgg aqalqavldl eltlvehgfs481 qpdivritgn rggaqalqav laleltlrer gfsqpdivki agnsggaqal qavldleltf541 rergfsqadi vkiagndggt qalhavldle rmlgergfsr adivnvagnn ggaqalkavl601 eheatlnerg fsradivkia gngggaqalk avleheatld ergfsradiv riagngggaq661 alkavlehgp tlnergfnlt divemaansg gaqalkavle hgptlrqrgl slidiveias721 nggaqalkav lkygpvlmqa grsneeivhv aarrggagri rkmvapller q

Hypothetical protein RBRH_01776, also of Burkholderia rhizoxinica HKI454, has the following amino acid sequence (GenBank Accession No.YP_004030669) (SEQ ID NO:49).

  1 mpatsmhqed kqsanglnls plerikiekh ygggatlafi snqhdelaqv lsradilkia 61 sydcaaqalq avldcgpmlg krgfsradiv riagngggaq alysvldvep tlgkrgfsqv121 dvvkiaggga qalhtvleig ptlgergfsr gdivtiagnn ggaqalqavl eleptlrerg181 fnqadivkia gngggaqalq avldvepalg krgfsrvdia kiagggaqal gavlgleptl241 rkrgfhptdi ikiagnngga qalqavldle lmlrergfsq adivkmasni ggaqalqavl301 nlepalcerg fsqpdivkma gnsggaqalq avldlelafr ergfsqadiv kmasniggaq361 alqavlelep alhergfsqa nivkmagnsg gaqalqavld lelvfrergf sqpeivemag421 niggaqalht vldlelafre rgvrqadivk ivgnnggaqa lqavfelept lrergfnqat481 ivkiaanggg aqalysvldv eptldkrgfs rvdivkiagg gaqalhtafe leptlrkrgf541 nptdivkiag nkggaqalqa vlelepalre rgfnqativk magnaggaqa lysvldvepa601 lrergfsqpe ivkiagnigg aqalhtvlel eptlhkrgfn ptdivkiagn sggaqalqav661 lelepafrer gfgqpdivkm asniggaqal qavlelepal rergfsqpdi vemagnigga721 qalqavlele pafrergfsq sdivkiagni ggaqalqavl eleptlresd frqadivnia781 gndgstqalk aviehgprlr qrgfnrasiv kiagnsggaq alqavlkhgp tldergfnlt841 nivkiagngg gaqalkavie hgptlqqrgf nltdivemag kgggaqalka vlehgptlrq901 rgfnlidive masntggaqa lktvlehgpt lrqrdlslid iveiasngga qalkavlkyg961 pvlmqagrsn eeivhvaarr ggagrirkmv alllerq

FIGS. 25A and 25B show an amino acid sequence alignment between theamino acid sequences of the two proteins represented. Also included inFIGS. 25A and 25B is a consensus sequence of identical or stronglysimilar positions thereof. The proteins have related and short N and Ctermini, indicating that the sequences represent the complete sequencesof the proteins. As shown in FIG. 26 for the RBRH_01776 protein, a TALrepeat region begins at amino acid 51 and ends at amino acid 958 and iscomposed of individual repeats of 33 amino acids. Most of the TAL repeatregions shown have a recognizable repeat variable diresidue sequence(boxed) beginning with an “N.” A partial TAL repeat precedes theindicated carboxyl flanking region.

As further discussed below, individual repeated sequences ofBurkholderia proteins tend to contain 33 amino acids and contain morehomology to each other than to known TAL repeat sequences. Theconservation of repeat length, and of several amino acid residuepositions (including nucleotide binding RVDs at positions 12 and 13)with known TAL repeat sequences suggest that these proteins areexpressed and functional and do not represent pseudo genes. The proteinsare believed to have nucleic acid binding activity, in part, due totheir similarity to known TALEs and TALE repeats.

Based upon the Burkholderia sequences, TAL repeats were characterized asset out in FIG. 27. Burkholderia repeat sequences nos. 1-18 are from theRBRH_01776 protein and repeat sequences A-T are from the RBRH_01844protein. The white letters on a black background show identical andstrongly similar amino acids as noted in TABLE 1 and the text describingthis table.

The double arrow

symbol in FIG. 27, as well as other figures herein, represents the twoamino acids that have recognition properties for particulardeoxyribonucleotides, i.e., the RVD or repeat variable diresiduesequence as described earlier herein. Those amino acid diresidues atpositions 12 and 13 for the Burkholderia repeat sequences above are asfollows: NA, ND, NG, NI, NK, NN, NR, NS, NT, and N-. Based oncorrelations between such repeat variable diresidues and their cognizantdeoxyribonucleotides, the Burkholderia RVD's appear to have specificityfor binding as follows: NA for recognizing guanine, ND for recognizingcytidine, NG for recognizing thymine, NI for recognizing adenine, NK forrecognizing guanine, NN for recognizing guanine or adenine, NR may lackspecificity, NS for recognizing any deoxyribonucleotide, NT forrecognizing any deoxyribonucleotide with a strong preference for adenineand guanine, and N- for recognizing cytidine or thymine, where -represents a gap in the second position of the RVD.

The Burkholderia repeat sequences contain several conserved regions. Inone aspect, the repeat sequence comprises the sequenceGG(A/T)Q(A/T)LX₁X₂V(L/F/I) (SEQ ID NO: 95) immediately after the repeatvariable diresidue at positions 12 and 13, i.e., at positions 14-23,where “X₁” and “X₂” are other than E or G and may be the same ordifferent. The parenthesis (A/T) means that either amino acid A or T maybe in the indicated position. Similarly, the parenthesis (L/F/I) meansthat either amino acid L or F or I may be in the indicated position. Inanother aspect, X₁ is Q, H, Y or K; and X₂ is A, T, S, or M. In anotheraspect, an amino acid sequence at positions 14-23 of a Burkholderiarepeat sequence is GGAQALX₁X₂VL (SEQ ID NO: 96) where “X₁” and “X₂” areother than E or G and may be the same or different, or X₁ is Q, H, Y orK; and X₂ is A, T, S, or M. In another aspect, an amino acid sequence atpositions 14-23 of a Burkholderia repeat sequence is GGAQALQAVL (SEQ IDNO: 97), or a sequence having 70%, 80% or 90% identity thereto.Positions are in reference to the repeat variable diresidue at positions12 and 13 identified above in FIG. 27 as

.

In another aspect, the repeat sequence for Burkholderia comprises thesequence GGAQAL (SEQ ID NO: 98) at positions 14-19, or a sequence having80% identity thereto. In another aspect, the repeat sequence forBurkholderia comprises I at position 6. Position 6 distinguishes theabove cited Burkholderia repeat sequences from those of Ralstonia andXanthomonas repeat sequences in that that position is V or L in theRalstonia and Xanthomonas sequences.

Further, in some aspects of a Burkholderia repeat sequence, position 5is other than Q, position 6 is other than V or L, position 8 is otherthan A or V, or position 26 is other than L. Positions are in referenceto the repeat variable diresidue at positions 12 and 13 identified abovein FIG. 27 as

.

In some aspects of a Burkholderia repeat sequence, position 1 is F, V,or L, position 2 is S or N, position 3 is Q or R, position 4 is A, P, orT, position 5 is D or T, position 6 is an I, position 7 is V or A,position 8 is K or R, position 9 is I or M, position 10 is A, position11 is G, position 24 is D or E, position 25 is L, V, or H, position 26is E or G, position 27 is P or L, position 28 is A or T, position 29 isL or F, position 30 is R or G, position 31 is E or K, position 32 is Ror position 33 is G. Positions are in reference to the repeat variablediresidue at positions 12 and 13 identified above in FIG. 27 as

.

In some aspects of a Burkholderia repeat sequence, position 1 is F, V,or L, position 2 is S, N, H, R or G, position 3 is Q, R, P or L,position 4 is A, P, T, G, S, I or D, position 5 is D, T, N or E,position 6 is an I, position 7 is V, A or I, position 8 is K, R, T, E orN, position 9 is I, M or V, position 10 is A, V, or T, position 11 is G,A or S, position 24 is D, E, N, S, or A, position 25 is L, V, or H,position 26 is E, G, or K, position 27 is P, L, S, R, or A, position 28is A, T, V, or M, position 29 is L or F, position 30 is R, G, C, D, H,V, or N, position 31 is E, K, or Q, position 32 is R, S, C, or H, orposition 33 is G or D.

In one aspect of repeat sequences, the repeat has a consensus proteinsequence FSQADIVKIAGNX₃GGAQALQAVLDLEPX₄LRERG (SEQ ID NO: 50) where “X₃”represents a DNA base recognition residue such as I, N, T, D, R, S, G, Kor A, and where “X₄” represents A or T, or a sequence having 60%, 70%,80% or 90% identity thereto.

Amine and Carboxyl Regions Flanking Burkholderia TAL repeats: The amineand carboxyl termini of Burkholderia proteins are naturally shorter thaneven truncated TALEs described herein for Xanthomonas or Ralstoniaspecies.

The amine terminal region of the RBRH_01844 TAL effector has twocandidate repeat structures roughly at amino acids 18-50 and 51-82(based on partial sequence homology to the repeated sequences) therebyproviding for a number of possible combinations for the amine-terminalsequence flanking the repeated sequences of an engineered TAL effector.For example, all 82 amino acids may be present (i.e., no truncations),amino acids 1-17 may be present, and/or amino acids 51-82 may be presentin the amine flanking region, or any combination thereof. Further,truncations from either end of the amine flanking sequence can generatealtered amine flanking regions for use in engineered constructs.Restriction sites may be introduced as needed into any location of anucleic acid encoding the amine flanking region to facilitate cloningprocedures. Further, a restriction site can be engineered into thisregion such that it will be relatively straightforward to make anydesirable modifications to the protein structure. For example,compatible restriction sites can be included such that the genes can becloned into the existing VP16/64 activator and FokI nuclease vectors, asdescribed elsewhere herein. Further, amine flanking sequences ortruncated amine flanking sequences used for Xanthomonas-type TAL repeatconstructs may be engineered to flank Burkholderia TAL repeats in anengineered construct.

In an aspect, the amine terminal region flanking the repeat regions ofboth proteins represented in FIGS. 25A and 25B contains the conservedamino acid sequences LNLSPLER (SEQ ID NO: 51) and TLAFISN (SEQ ID NO:52). The invention includes proteins comprising one or both of theseamino sequences, as well as proteins containing amino acid sequences atleast 80%, 85%, or 90% identical or strongly similar to one or both ofthese sequences.

As stated herein, nucleic acid target sites of TAL effectors tend toinclude a thymine base flanking the 5′ base targeted by the first repeatof the effector; this appears to be due to a contact between the thymineand a conserved tryptophan residue in the amine flanking regionN-terminal to the repeated sequences. In contrast to this pattern, whichwas essentially based on TALEs from Xanthomonas, there is no tryptophan(W) residue in the N terminal (or any) region of the Burkholderiaproteins, which suggests that a 5′ thymine in the DNA binding site isnot required.

The sequence from amino acid 710 to 741 of the RBRH_01844 amino acidsequence shown in FIG. 2A is shown in FIG. 27 as line T, and may or maynot be a functional repeat in the sense that it binds a base in anucleic acid molecule. In any event, the sequence of this proteindemonstrates a relatively short C-terminal region of 29 or 61 residues.Further, truncations from either end of the carboxyl flanking sequencecan generate altered carboxyl flanking regions for use in engineeredconstructs. Further, restriction sites may be introduced as needed intoa nucleic acid encoding the carboxyl flanking region to facilitatecloning procedures and/or to make any desirable modifications to theprotein structure carboxyl to the repeated sequence. For example,compatible restriction sites can be included such that the genes can becloned into the existing VP16/64 activator and FokI nuclease vectors.Further, carboxyl flanking sequences or truncated carboxyl flankingsequences used for Xanthomonas-type TAL repeat constructs may beengineered to flank Burkholderia TAL repeats in an engineered construct.

In an aspect, the carboxyl terminal region flanking the repeat regionsof both proteins represented in FIGS. 25A and 25B contains the conservedamino acid sequences YGPVLMQAGRSNEEIVHVAARRGGAGRIRKMVA (SEQ ID NO: 53)and LLERQ (SEQ ID NO: 54). The invention includes proteins comprisingone or both of these amino sequences, as well as proteins containingamino acid sequences at least 80%, 85%, or 90% identical or stronglysimilar to one or both of these sequences.

The short flanking regions of Burkholderia TAL effectors can obviate theneed to use TALE amine or carboxyl truncations as described earlier.Further, the particularly compact structure of Burkholderia TALeffectors contributes to shorter vector molecules, smaller plasmids thatare more efficiently introduced into cells, and to smaller proteins thatare generally more highly expressed.

TAL Repeat Structures—Marine Organism-Based

Further TAL repeat structures are found in marine organisms designatedherein as “Marine Organism A” and “Marine Organism B.” The organismsfrom which these TAL repeat sequences were derived have not beenidentified and sequence alignment based searches of the available aminoacid sequence data yielded provided no additional information related tothe identification of these organisms.

FIG. 28 shows an alignment and a consensus sequence for repeats fromMarine Organism A. The repeat variable diresidues at positions 12 and 13of Marine Organism A are sequences known to recognize particular basesin DNA, i.e., the diresidue HG recognizes T, HD recognizes C, and NNrecognizes G or A.

A conserved six amino acid sequence of GGSKNL (SEQ ID NO: 83), atpositions 14-19, immediately follows the repeat variable diresiduesequence. Another conserved sequence is IVQMVS (SEQ ID NO: 99), atpositions 6-11. The isoleucine at position 6 is invariant among MarineOrganism A1-A9 repeats; that position 6 is also isoleucine and invariantin Burkholderia repeats but that position has not been found to beisoleucine in Xanthomonas or Ralstonia species thus far. The inventionincludes proteins which contain the amino acid features referred toabove (e.g., the sequences: GGSKNL (SEQ ID NO: 83) and/or IVQMVS (SEQ IDNO: 99)).

FIG. 29 provides an alignment and a consensus sequence for repeats fromMarine Organism B. The repeat variable diresidues at positions 12 and 13of Marine Organism B are sequences known to recognize particular basesin DNA, i.e., the diresidue HG recognizes T, HD recognizes C, HIrecognizes C, and NN recognizes G or A.

The six amino acid sequence immediately following the repeat variablediresidue sequence at positions 14-19 has a sequence GA(T/N)(Q/K)(A/T)I(SEQ ID NO: 100). This sequence differs from that of TABLE 4(GGK(P/Q)AL) (SEQ ID NO: 101), and from that of Burkholderia repeats(GG(A/T)Q(A/T)L (SEQ ID NO: 102), and from Marine Organism A (GGSKNL)(SEQ ID NO: 83). Another conserved sequence is PKDIVSIAS (SEQ ID NO:103), at positions 3-11. The isoleucine at position 6 is againinvariant, similar to that of Marine Organism A and that ofBurkholderia; position 6 has not been found to be isoleucine inXanthomonas or Ralstonia species thus far. The invention includesproteins which contain the amino acid features referred to above (e.g.,the sequences: GA(T/N)(Q/K)(A/T)I (SEQ ID NO: 100) and/or PKDIVSIAS (SEQID NO: 103)).

TAL Repeat Structures—Blood Borne Pathogen-Based

Further TAL repeat structures are found in a blood-borne pathogendesignated herein as “BBP.” Based upon amino acid sequence alignment ofthe protein which contains the TAL repeats, this organism is likely astrain of Ralstonia solanacearum.

FIG. 30 provides an alignment for six repeats from BBP, further comparedwith a number of repeats of proteins from Xanthomonas, Ralstonia andMarine Organism B. A consensus sequence is provided for repeatstructures from these four organisms.

The repeat variable diresidues at positions 12 and 13 of the blood-bornepathogen are sequences that recognize particular bases in nucleic acids(e.g., the diresidue NG recognizes thymine, NN recognizes guanine oradenine, NT recognizes any deoxyribonucleotide with preference foradenine and guanine, and SI is thought to recognize adenine orcytosine).

The six amino acid sequence immediately following the repeat variablediresidue sequence for blood-borne pathogen repeats, at positions 14-19,has a sequence GG(K/R)QAL (SEQ ID NO: 104). Further, position 6 is avaline. Another fairly conserved sequence is QVV(A/V)IA(S/N) (SEQ ID NO:105), at positions 5-11. The invention includes proteins which containthe amino acid features referred to above (e.g., the sequences:GG(K/R)QAL (SEQ ID NO: 104) and/or QVV(A/V)IA(S/N) (SEQ ID NO: 105)).

Using TAL repeat sequences in engineered constructs: A TAL effectorfusion construct may be designed as described herein to containBurkholderia flanking and/or repeated sequences or to contain marineorganism repeated sequences. That is, in one aspect, at least one of theamine flanking region, the repeated sequence, and the carboxyl flankingregion of a construct may be substantially based on a Burkholderiasequence as provided herein while remaining sequences of a construct maybe substantially based on Xanthomonas or Ralstonia sequences.

Further, when the amine or carboxyl flanking regions of an engineeredTAL protein are prepared, these flanking regions may contain portions ofone or both flanking regions set out in FIG. 25A and/or FIG. 25B. Insome embodiments, one or both flanking regions of the engineered proteinmay contain an amino acid region comprising from about 10 to about 30amino acids, about 10 to about 40 amino acids, about 15 to about 40amino acids, about 15 to about 30 amino acids, about 15 to about 20amino acids, about 10 to about 20 amino acids, etc., identical orstrongly similar to an amino acid sequence shown in FIG. 25A and/or FIG.25B.

In another aspect, a repeated sequence may be substantially based on amarine organism repeated sequence as provided herein while carboxyl oramine flanking sequences may be substantially based on a Xanthomonas, aRalstonia, or a Burkholderia sequence, for example.

Summary of Tal Protein Homologs: TABLE 5 shows positional amino acidsequence variations derived from fifty-one naturally occurring TALrepeats and repeats from proteins believed to be TAL protein homologs.The numbering in TABLE 5 corresponds to individual positions in TALrepeats in which positions 12 and 13 designate the repeat variablediresidue which recognizes particular deoxyribonucleotides. The numbersnext to the amino acid designations indicate the number of TAL repeatsthat contain that particular amino acid in that location. For example,at position 1, phenylalanine was found 26 times, leucine was found 24times, and valine was found 1 time.

TABLE 5 TAL Repeat Variations 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17Repeat F 26 S 17 P 22 E 16 D 26 I 33 V 50 A 17 I 34 A 41 S 32 G 51 G 44A 18 Q 40 Position L 24 T 11 T 13 A 10 Q 18 V 17 I 1 K 14 M 17 V 10 G 14A 6 K 17 K 10 V 1 R 9 Q 13 K 6 E 2 L 1 Q 9 A 3 S 1 S 9 P 1 E 5 R 2 P 5 G2 S 6 N 2 T 4 N 5 L 1 D 4 T 2 E 2 N 2 G 1 T 3 N 1 R 1 R 1 H 1 Q 3 T 1 L1 G 2 V 1 Q 1 N 1 S 1 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 3435 Repeat A 41 L 45 E 18 A 29 V 39 L 23 D 12 L 24 E 18 P 27 A 22 L 46 R25 E 15 R 17 G 28 −33 −42 Position N 9 I 6 Q 13 T 17 L 6 Q 14 E 12 K 9 L18 L 9 V 13 F 5 C 10 A 11 A 15 H 9 G 9 E 7 T 1 A 5 S 3 I 5 K 5 A 10 Q 8W 6 A 5 T 7 T 6 Q 8 L 10 P 9 Y 8 N 1 T 5 V 2 M 1 I 4 R 9 N 4 Y 6 D 4 E 3K 3 G 7 K 5 E 3 L 1 G 1 V 3 R 2 N 4 V 4 S 3 T 4 D 2 G 2 K 4 V 2 D 1 Y 3T 2 T 3 R 1 G 2 I 2 H 2 D 3 D 1 K 1 G 2 F 1 V 1 S 1 M 1 I 2 R 1 S 1 H 2S 1 D 1 T 1 V 1

When assessing the data of TABLE 5, several factors should beconsidered, including the following:

-   -   1. All TAL repeats within a TAL protein (1) may not be        functional and/or (2) may exhibit otherwise unfavorable DNA        binding activity (e.g., binding affinity which is too high or        too low for optimal DNA functional interaction). Since TAL        proteins tend to recognize multiple bases, a TAL protein may        interface with DNA correctly even where one or more repeat is        non-functional.    -   2. A single amino acid alteration may result in TAL repeat        becoming non-functional but this non-functionality may be        corrected by a one or more amino acid alteration(s) at another        location within or external to the TAL repeat.    -   3. The data presented in TABLE 5 is derived from a subset of        known and predicted TAL proteins.    -   4. A TAL repeat modification that results in enhanced DNA        binding activity may confer a selective disadvantage to a host        cell because it could result in functional activity (e.g.,        transcriptional activation) which is either “leaky” or difficult        to “off-regulate”.

Some of the amino acids in particular positions of TABLE 5 are wellconserved and others are much less conserved. As examples, amino acidpositions 1, 6, 7, 9, 10, 14, 15, 17, 18, 19, and 29 are well conserved.Using amino acid position 1 for purposes of illustration, three aminoacids have been found: Phenylalanine, leucine, and valine. Further,valine is seen only once. These data suggest that having a valine at inposition 1 of a TAL repeat is not optimal. Further, amino acids withexhibiting low conservation include positions 2, 4, 8, 16, 20, 23, 24,25, 26, 27, 28, 30, 31, 32, 33, 34, and 35, with amino acids positions34 and 35 optionally being deleted.

In general, a degree of amino acid conservation is seen in the region ofTAL repeats on the amine terminal side of positions 12 and 13. Further,amino acid alterations within TAL repeats are expected to alter TALprotein DNA binding activity. TABLE 6 shows amino acids found atindividual repeat locations that were identified on the amino-terminalside of the RVD where an amino acid appeared more than once.

TABLE 6 TAL Repeat N-Terminal Amino Acid Variations 1 2 3 4 5 6 7 8 9 1011 F, L S, T, P, T, E, A, K, D, Q, E, I, V V A, K, I, M A, V S, G, R, E,Q, R P, D, T, G, T Q, S, A, N N Q, G E

Thus, an aspect of the invention includes TAL repeats, as well as TALproteins that contain such TAL repeats, that contain amino acids shownin TABLE 6 at the indicated locations. For purposes of illustration, theinvention includes TAL repeats that contain one or more of the followingamino acid sequences: FSPEDIVAIAS (SEQ ID NO: 55), FTPEDIVAIA S(SEQ IDNO: 56), LTPADIVAIAS (SEQ ID NO: 57), LSPAQVVAIAS (SEQ ID NO: 58), andLTPAQIVKIAS (SEQ ID NO: 59).

An aspect of the invention may further include TAL repeats that containphenylalanine or leucine at position 1, isoleucine or valine at position6, valine at position 7, isoleucine or methionine at position 9, and/oralanine or valine position 10.

A degree of amino acid conservation is also found immediately flankingthe repeat variable diresidue sequences at positions 12 and 13 of therepeats, that is from positions 6-11 and 14-19. For example, TABLE 7shows amino acids found at individual repeat locations that wereidentified on the amino-terminal side of the RVD at positions 6-11 andon the carboxyl-terminal side of the RVD at positions 14-19 where anamino acid appeared more than once.

TABLE 7 TAL Repeat Amino Acid Variations Immediately Flanking the RepeatVariable Diresidue Sequences (i.e., positions 12 and 13 designated ///below) 6 7 8 9 10 11 /// /// 14 15 16 17 18 19 I, V V A, K, I, M A, V S,G, /// /// G G, A A, K, Q, K A, N L, I Q, S, E A, N S, T, N

Thus, an aspect of the invention includes TAL repeats, as well as TALproteins that contain such TAL repeats, that contain amino acids shownin TABLE 7 at the indicated locations. For purposes of illustration, theinvention includes TAL repeats that contain one or more of the followingamino acid sequences: (I/V)V(A/K)(I/M)(A/V)(S/G) (SEQ ID NO: 60) atpositions 6-11, and G(G/A)(A/K/S)(Q/K)(A/N)(L/I) (SEQ ID NO: 61) atpositions 14-19, where (_/_) indicates that either amino acid couldoccur at that position.

An aspect of the invention includes TAL repeats that contain one or moreof the following: glycine at position 14, glycine or alanine at position15, glutamine or lysine at position 17, alanine or asparagine atposition 18, leucine or isoleucine at position 19, alanine, threonine,serine, or valine at position 21, valine, leucine, or isoleucine atposition 22, leucine, lysine, or phenylalanine at position 25, alanine,valine or threonine at position 28, and leucine or phenyalanine atposition 29.

The following amino acid sequence represents a TAL repeat composed ofthe most commonly identified amino acids found at each position:FSPEDIVAIASX₅X₆GGAQ ALEAVLDLEPALRERG (SEQ ID NO: 62), where X₅ and X₆are repeat variable diresidue sequences.

Additional variations containing common variations of amino acids are asfollows: LSPEDIVKIAGX₅X₆GGKQALQAVLELEPVLCERG (SEQ ID NO: 63), where X₅and X₆ are repeat variable diresidue sequences at positions 12 and 13,LTTEQIVA MASX₅X₆GGAKALEAVLDLEPALRERHG (SEQ ID NO: 64), where X₅ and X₆are repeat variable diresidue sequences at positions 12 and 13,FSPEDIVAIASX₅X₆GG AQALEAVLDLEPALRERHGE (SEQ ID NO: 65), where X₅ and X₆are repeat variable diresidue sequences at positions 12 and 13, orFTPEDIVKIAGX₅X₆GGKQA LEAVLDLEPVLRERG (SEQ ID NO: 66), where X₅ and X₆are repeat variable diresidue sequences at positions 12 and 13.

In some instances, particular TAL repeats may be non-functional in thesense that these repeats do not recognize one or more base in a specificlocation in a nucleic acid molecule. As an example, consider thesituation where a protein contains 15 TAL repeats and TAL repeats 1through 8 and 10-15 recognize specific, ordered bases in a nucleic acidmolecule. Further, assuming that the binding sequence for this proteinis as follows: ATCGT AGCTG TTGAT (SEQ ID NO: 67). In an instance whereTAL repeat number 9 recognizes no base, as long as structural propertiesof the TAL repeat region are maintained, the protein would be expectedto still bind the 15 nucleotide recognition sequence. Further, so longas the recognition sequence is long enough, unless duplication eventshave occurred, the sequence will occur rarely in the genome.

The invention thus includes proteins which contain TAL repeats where aportion of the TAL repeats are non-functional in that they do notrecognize one or more base in a specific location in a nucleic acidmolecule, as well as nucleic acids which encode such proteins andmethods for using such proteins and nucleic acids. In particular, theinvention includes proteins which contain one or more (e.g., one, two,three, four, five, six, seven, eight, etc.) non-functional TAL repeats.Proteins of the invention may contain from about 1 to about 10, fromabout 1 to about 10, from about 2 to about 10, from about 3 to about 10,from about 4 to about 10, from about 1 to about 6, from about 1 to about5, from about 1 to about 4, from about 2 to about 4, from about 2 toabout 3, etc., non-functional TAL repeats.

The invention also includes proteins which contain TAL repeats whichrecognize more than one (e.g., one, two, three, four, five, six, seven,eight, etc.) nucleotide sequence. For purposes of illustration, a singleTAL repeat regions may be designed which recognizes each of thefollowing nucleotide sequences: ATCGN ANCNG TTGAT (SEQ ID NO: 68), whereN is any base. A single TAL repeat may be designed where repeat numbers5, 7, and 9 are not specific for any base but do not disrupt the abilityof a protein containing these non-functional repeats from binding thenucleotide sequence.

Non-functionality of a TAL repeat may be conferred by the RVD sequenceof flanking amino acid sequences. For example, the RVD may not recognizea base. Also, one or both regions flanking the RVD may have a secondarystructure which renders the TAL repeat non-functional.

The invention further includes methods for using one or more proteinswhich contain TAL repeats for interacting with multiple locations in acellular genome. One application advantage of a TAL repeat containingprotein with “loose” structure recognition is that proteins can bedesigned which bind to more than one location in a genome. In someinstances, these locations may be engineered to contain a recognitionsequence, the recognition sequences may be naturally present in thegenome, or a combination of engineered and naturally occurring sequencesmay be used. In particular, the invention includes methods comprising(1) engineering a cell to contain a sequence at a specific genomiclocation which is recognized by a TAL repeat and (2) and introducinginto or expressing within the cell a protein containing the TAL repeatwherein the TAL repeat recognizes both the sequence introduced into thecellular genome and a sequence which occurs naturally within the genome.

The invention includes proteins which contain the above sequences andvariations thereof, for example, as indicated herein, as well as methodsfor designing, screening, and producing such proteins, further includingnucleic acid molecules which encode such proteins.

TAL truncations. Naturally occurring TAL effectors from bacteria havebeen identified. In many instances, these proteins are believed to havefunctional activity in plant cells. It has been found that modificationof naturally occurring TAL effectors can alter TAL effector fusionactivities, especially activity in various types of cells. As anexample, it has been shown that by using truncated flanking regions ofnaturally occurring TAL effectors, TAL effectors fusion proteins can begenerated with altered activities within mammalian cells (see, e.g., PCTPublication WO 2011/146121, the disclosure of which is incorporatedherein by reference, and Miller et al., Nat. Biotechnol., 29:143-148(2011)). Thus, the invention provides TAL effectors and TAL effectorsfusions with functional activities (e.g., sequence specific DNA bindingactivities, sequence specific nuclease activities, sequence specifictranscription activation activities, etc.) in various cell types (e.g.,plant cells, animal cells, mammalian cells, human cells, human livercells, etc.).

One mechanism for alter functional activities of TAL effectors and TALeffectors fusions is by alteration of the amino and carboxyl regionswhich flank the TAL repeats. In certain embodiments, either of the aminoflanking region or the carboxyl flanking region are altered. Inadditional embodiments, both the amino flanking region and the carboxylflaking region are altered. In many instances, the TAL effectors and TALeffectors fusions will be altered in a manner so as to provide higherfunctional activities in a particular cell type.

Using the Hax3 amino acid sequence shown in FIG. 3A as a point ofreference, specific alterations to the amino flanking region of the TALrepeat include the deletions up to amino acids 10, 25, 50, 65, 75, 90,110, 115, 148, 152, 161, 175, 182, 195, 202, etc. up to the beginning ofthe TAL repeat region. In some embodiments, amino flanking regions maybe from about 10 to about 400 amino acids, from about 50 to about 400amino acids, from about 100 to about 400 amino acids, from about 150 toabout 400 amino acids, from about 152 to about 400 amino acids, fromabout 200 to about 400 amino acids, from about 10 to about 300 aminoacids, from about 10 to about 288 amino acids, from about 10 to about200 amino acids, from about 10 to about 100 amino acids, from about 10to about 75 amino acids, from about 10 to about 50 amino acids, fromabout 50 to about 300 amino acids, from about 50 to about 200 aminoacids, from about 90 to about 300 amino acids, from about 90 to about200 amino acids, from about 100 to about 300 amino acids, from about 100to about 200 amino acids, etc., in length.

By “carboxyl flanking region” and “amino flanking region”, when used inthe context of TAL repeats, is meant either naturally occurring flankingregions or derivatives thereof. A “derivative”, as used with respect toTAL flanking regions refers to truncations of naturally occurringflanking regions and amino acid segments of at least 20 amino acidswhich share at least 90% amino acid sequence identity with a naturallyoccurring TAL flanking region from a species in either of the followinggenera: Xanthomonas or Ralstonia. Thus, carboxyl and amino flankingregions include truncations and will generally include at least 10 aminoacids of a naturally occurring TAL effector (e.g., Hax3) flankingregion. Heterologous amino acid segments not normally associated withnatural TAL effectors (e.g., a V5 epitope) do not fall within the scopeof carboxyl and amino flanking regions. Thus, as an example, amino acidsegments 1, 2, 6 and 7 in FIG. 3B are not carboxyl and amino flankingregions as the terms are used with respect to a TAL repeat.

In some instances, TAL effectors and TAL effector fusions of theinvention will contain one or more of the sequences set out in TABLE 8and/or not one or more of the sequences set out in TABLE 9.

TABLE 8 AHIVALSQHPAALGTVAV SEQ ID NO: 8 RNALTGAPLN SEQ ID NO: 9DTGQLLKIAKRGGVTAV SEQ ID NO: 10 AGELRGPPLQLDTGQLL SEQ ID NO: 11KIAKRGGVTAVEAVHA SEQ ID NO: 12

TABLE 9 VDLRTLGYSQQQQ SEQ ID NO: 13 VDLCTLGYSQQQQ SEQ ID NO: 14EALVGHGFTHAHI SEQ ID NO: 15 SQQQQEKIKPKVR SEQ ID NO: 16 STVAQHHEALVGHSEQ ID NO: 17

Again, using the Hax3 amino acid sequence shown in FIG. 3A as a point ofreference, specific alterations to the carboxyl flanking region of theTAL repeat include the deletions up to amino acids 10, 25, 50, 65, 75,90, 110, 115, 148, 152, 161, 175, 182, 195, 202, 250, etc. from the endof the TAL repeat region. In some embodiments, carboxyl flanking regionsmay be from about 10 to about 400 amino acids, from about 50 to about400 amino acids, from about 100 to about 400 amino acids, from about 150to about 400 amino acids, from about 152 to about 400 amino acids, fromabout 200 to about 400 amino acids, from about 10 to about 300 aminoacids, from about 10 to about 282 amino acids, from about 10 to about200 amino acids, from about 10 to about 100 amino acids, from about 10to about 75 amino acids, from about 10 to about 50 amino acids, fromabout 50 to about 300 amino acids, from about 50 to about 200 aminoacids, from about 90 to about 300 amino acids, from about 90 to about200 amino acids, from about 100 to about 300 amino acids, from about 100to about 200 amino acids, etc., in length.

In some instances, TAL effectors and TAL effectors fusions of theinvention will contain one or more of the sequences set out in TABLE 10and/or not one or more of the sequences set out in TABLE 11.

TABLE 10 ALTNDHLVALACLG SEQ ID NO: 18 GRPALDAVKKGLPHAP SEQ ID NO: 19QLFRRVGVTE SEQ ID NO: 20 NRRIPERTSH SEQ ID NO: 21 VRVPEQRDALHSEQ ID NO: 22

TABLE 11 ADDFPAFNEEE SEQ ID NO: 23 LAWLMELLPQ SEQ ID NO: 24 LHAFADSLERDLSEQ ID NO: 25 DAPSPMHEGDQT SEQ ID NO: 26 GTLPPASQRW SEQ ID NO: 27

The total size of TAL effector flanking regions (i.e., amino terminaland carboxyl terminal combined) may be in the ranges of from about 50 toabout 1,000 amino acids, from about 100 to about 1,000 amino acids, fromabout 150 to about 1,000 amino acids, from about 200 to about 1,000amino acids, from about 300 to about 1,000 amino acids, from about 100to about 700 amino acids, from about 100 to about 500 amino acids, fromabout 150 to about 800 amino acids, from about 150 to about 500 aminoacids, etc., amino acids. The amino flanking region and the carboxylflanking region may be of about the same size or of different sizes. Forexample, either of flanking regions may be comprised of amino acids in aratio of from 1:1 about to 3:1, from 1:1 about to 4:1, from 1:1 about to2:1, from 1:1 about to 5:1, from 1:1 about to 6:1, etc., as compared tothe other flanking region. As an example, if a hypothetical TAL effectorhas a 1:3 ratio of amino acids in the amino flanking region and thelarger flanking region is the amino flanking region with 150 amino acid,then the carboxyl flanking region is 50 amino acids in length.

Flanking sequences may or may not be linked to polypeptide segmentswhich have additional functional activities (e.g., heterologousfunctional activities) and/or elements (e.g., an affinity tag such as aV5 epitope).

TAL effector and TAL effector fusions may also be characterized by theirproperties (e.g., the ability to bind nucleic acid, an enzymaticactivity, etc.). Further, activities may be measured in cells or outsideof cells. In addition, when activities are measured intracellularly,these activities may vary with the type of cell. The invention providesTAL effectors and TAL effector fusion with specific functional activitycharacteristics. In many instances, such activities and activity levelswill be functional characteristics of TAL proteins and, thus, will be afeature of these TAL protein compositions of matter.

Qualitative and Quantitative TAL Binding Assays

Extracellular TAL effector binding activity may be assessed by anynumber of means. TAL effector binding assays may be qualitative orquantitative. In a qualitative assay, TAL effector binding activitywould normally be measured as either present or absent. In quantitativeand semi-quantitative assays, the amount of binding is measured. Mostassays used would be quantitative to some extent because such assaysbetter discriminate between non-specific and specific binding. Also,qualitative binding assays allow for the identification of bindingmolecules with specific binding affinities. Such assays also allow forcomparative assessment of binding activity. Usually, a standard is usedto set a baseline, with weaker binder exhibiting lower binding activityand stringent binders exhibiting higher binding activity.

One type of qualitative binding assay is set out in Example 2 and FIGS.4A and 4C.

The in vitro binding assay is sensitive, fast, easy to perform and canbe applied to any TAL protein to demonstrate TAL binding specificityand. For this purpose, the TAL protein may be fused with a purificationor detection tag (e.g., V5 epitope, c-myc, hemagglutinin (HA), FLAG™,polyhistidine (His), glutathione-S-transferase (GST), maltose bindingprotein (MBP)) and expressed, e.g., in a cell free system.

Efficient in vitro cell-free expression systems suitable for use in theassay without limitation are, e.g., an E. coli S30 fraction, a RabbitReticulocyte lysate, a wheat germ extract, a human cell extract oranother expression system known in the art. A well known prokaryotic invitro translation system is the E. coli crude extract (30S) whereendogenous mRNA is removed by run-off translation and subsequentdegradation. The E. coli system comprises user friendly translationapparatus and allows for convenient control of initiation.

A commonly used eukaryotic cell-free expression system is the rabbitreticulocyte lysate. Reticulocytes are immature red blood cellsspecialized for haemoglobin synthesis (Hb is 90% of protein content)lacking nuclei but comprising a complete translation machinery.Endogenous globin mRNA may be removed by treatment with Ca²⁺⁻dependentmicrococcal nuclease, which is then inactivated by EGTA-chelation ofCa²⁺⁻. Exogenous proteins are synthesized at a rate close to thatobserved in intact reticulocytes. Both capped (eukaryotic) and uncapped(viral) RNA are translated efficiently in this system. Kozak consensusand polyA signal are generally provided on the RNA. This system allowsfor synthesis of mainly full-length products.

Another common cell-free expression system which is a convenientalternative to rabbit reticulocyte lysate is the wheat germ extract, asystem with low levels of endogenous mRNA and thus, low background whichallows for high level synthesis of exogenous proteins of mammalian,viral or plant origin.

The obtained protein extract containing the TAL protein is added to asolid support, such as, e.g., a coated plate or coated beads. Twodifferent embodiments of such assay are illustrated in FIGS. 4A and 4C.In some embodiments, the TAL protein is a His-tagged protein and iscaptured on a Nickel-coated support. However, also other suitablesystems known in the art can be used to capture tagged proteins ontosolid supports. One further example is the streptavidin-biotin system orthe FlAsH system (Life Technologies Corp, Carlsbad) where proteinscontaining the tetracysteine motif Cys-Cys-Pro-Gly-Cys-Cys (SEQ ID NO:106) are specifically bound by FlAsH or ReAsH arsenic reagents. In oneembodiment, the TAL protein carries an N-terminal or C-terminal His-tagand is captured on Nickel-coated plates or Nickel-coated beads. Theunbound protein is washed away and double stranded DNA targets (“bindingprobe”) containing the predicted target site are then incubated with thebound protein. In a parallel reaction unrelated control DNA may be used.The DNA incubation may occur either prior to or after the proteinbinding step. Following incubation with binding probes the solid supportmay be washed again and the complexes are further incubated with alabeling dye such as e.g., an intercalating agent. Different reagentssuitable for said purpose are known in the art and may include, e.g.,PICOGREEN®, YOPRO, SYBR® Green, Ethidium bromide (EthBr), EnhanCE orothers. The labeled complexes may then be analyzed by measuringfluorescence or may be subject to real-time PCR analysis as, e.g.,illustrated in FIG. 4B.

Thus, the invention relates, in part, to a TAL binding assay wherein theassay includes at least the following steps: (i) expression of a taggedTAL protein, (ii) binding of the TAL protein to a solid support, (iii)incubating the TAL protein with a DNA probe, (iv) incubating the complexwith an intercalating fluorescent dye, and (v) detecting the bound DNA.Step (i) of the binding assay may further be performed in a cell-freeexpression system. In one embodiment, the binding of TAL protein in step(ii) is mediated by a protein tag such as a His-tag and the solidsupport may, e.g., be a Nickel-coated plate or Nickel-coated beads. Insome instances, step (iii) may precede step (ii). In another embodiment,washing steps are performed after steps (ii) and/or (iii) and/or (iv).Furthermore the invention relates to a TAL binding kit comprising atleast the following components: (i) a customized TAL expression vector,(ii) a solid support for TAL protein binding, (iii) one or more buffersystems, (iv) a specific binding probe, (v), an unspecific bindingprobe, (vii) an intercalating fluorescent dye.

The TAL binding kit may further comprise an extract for cell-freeprotein expression. Furthermore the customized TAL expression vector maycomprise a sequence encoding a protein tag (e.g., a His-tag) to allowfor expression of a tagged TAL protein. In another embodiment the TALbinding kit may further comprise one or more binding buffer and/orwashing buffer systems.

TAL binding assays can be used to rapidly test TAL nuclease activity invitro using crude TAL nuclease protein mixtures expressed in a cell freesystem. Qualitative assays essentially test parameters, such as themechanics of target recognition, spacing, and cleavage of a syntheticlinear template. In certain instances, it may be desired to furtheradjust the binding assay by making it more quantitative to allow abetter approximation of enzyme kinetics which supports prediction of TALnuclease activity at specific genomic loci in cells. Assignment ofspecific activity of a particular TAL nuclease pair to its synthetictarget would allow prediction of relative activity in a cellular contextmay also be desirable. In an initial step, the concentration of TALnuclease in a cell free expression mix (as described above) isquantified. This information may then used to develop a linear standardcurve of activity from which enzyme kinetic data can be generated.Several TAL nuclease pairs may be evaluated an assigned specificactivity values which are then compared to locus specific cleavage incells as measured by a mismatch repair endonuclease assay (as describedin detail elsewhere herein). Combining the information obtained fromboth analyses allows for a clear correlation between TAL nuclease pairspecific activity in vitro and locus modulation efficiency in a cellunder a controlled set of conditions. Thus the invention relates in partto a quantifiable in vitro assay to predict TAL nuclease activity invivo. In one embodiment the quantifiable assay is characterized by atleast the following steps:

1. Quantification of TAL nuclease concentration from a crude cellextract. Various methods may be used for quantifying TAL nucleaseproteins from crude cell extract. For example, a TAL nuclease can beexpressed with an affinity tag (e.g., a N-terminal or C-terminal tag,such as, e.g., a His-tag) and rapidly purified/enriched using anaffinity purification resin (e.g., Ni-NTA or similar resins). Resultingprotein fractions can then be quantified via standard protein assays.Alternatively, an in situ assay can be applied where the expressed TALnuclease contains an N-terminal FlAsH tag. By adding the FlAsH reagent,fluorescence in a particular reaction can be read against a standardcurve of purified, similarly tagged protein.

2. In vitro enzymatic determination of TAL nuclease activity. Followingquantification of a panel of TAL nucleases, known amounts (e.g.,equimolar amounts) of the respective TAL nuclease cleavage half domainsmay be incubated with a fixed molar amount of target template understandard conditions (at a given time, temperature, ionic strength). Fromsuch titrations, a range of concentrations is determined which yields alinear function of cleavage activity (% template cleaved) to TALnuclease pair concentration. Based on the obtainedcleavage-concentration ratio a unit measurement (e.g., 50% cleavage of xmoles template equals 1 unit) may be assigned. At a suitable lineardynamic range (for instance one to two logs concentration), TALnucleases can be expressed, quantified, normalized, and assayed at fixedconcentration to measure specific activity (units/mass).

3. Correlation of TAL nuclease specific in vitro activity withendogenous locus modification activity in vivo. The panel of TALnuclease pairs tested in step 1. may then ranked ordered according tothe specific activities measured in vitro and subsequently tested intheir specific host cell lines (as described elsewhere herein indetail). With the third step a relative correlation of determined invitro activity to effective in vivo activity may be gained that allowsfor prediction of TAL nuclease functionality in the desired host.

A quantifiable assay according to the invention may be offered as partof a custom service by a TAL service provider. Alternatively, the assaymay be offered in the context of a kit providing all reagents, protocolsand analysis tools to allow expression, purification and measurements ofone or more TAL nuclease pairs according to the three steps describedabove. Such kits may be aided by suitable programs or equations toperform required calculations (e.g., as specified under step 2.). Thereis a desire for quality testing of TAL nucleases prior to initiatingpotentially long and expensive experimental protocols in cells.Therefore, kits and assays according to the invention can helpresearchers to efficiently screen multiple TAL nuclease configurationsto ensure their experimental protocol is based on the most optimalconfiguration.

Another binding assay is a sandwich assay employing a solid support withnucleic acid molecules with sequences recognized by a TAL effector. TheTAL effector is then contacted with the solid support under conditionswhich allow for binding. After an incubation period, unbound TALeffector molecules are removed and the bound TAL effectors arequantified with a labeled anti-TAL effector antibody.

Another type of assays is referred to as a mobility shift DNA-bindingassay (see, e.g., FIG. 4E). In one variation of this DNA-binding assay,nondenaturing polyacrylamide gel electrophoresis (PAGE) is used toprovide simple, rapid, and sensitive detection of sequence-specificDNA-binding proteins.

For example, proteins that bind specifically to a labeled (e.g.,end-labeled, nick translation labeled, etc.) DNA fragment retard themobility of the fragment when the DNA fragment is subjected toelectrophoresis. This results in discrete bands corresponding to theindividual protein-DNA complexes and unbound DNA fragments. Oneadvantage of this assay is that either purified proteins or extracts maybe used. Also, data derived from such assays may be used to makequantitative determinations of the (1) affinity, (2) DNA binding proteinconcentration, (3) association rate constants, (4) dissociation rateconstants, and (5) binding specificity of the subject DNA-bindingproteins. Further, banding patterns may be used to identify bands whichcontain two TAL effectors bound to each nucleic acid molecule. This isso because such nucleic acid molecules will be retarded during PAGE morethan nucleic acid molecules which are not bound by a TAL effector andnucleic acid molecules to which only one TAL effector is bound. TALeffectors which function as nucleases will often have functionalactivity upon dimerization of nuclease domains. Mobility shift assaysallow for the measurement of TAL effectors with binding activities thatallow for dimer formation.

Even protein-DNA complexes with short half-lives (<1 minute) arenormally detected by mobility shift assays despite the fact thatelectrophoresis takes significant amounts of time. This is so becausekinetic stability is typically not required for detection of protein-DNAcomplexes. Further, the sensitivity of these assays is often in thefemtomole range.

The invention further relates to another assay format that is suitableto confirm specific DNA binding of customized TAL effector proteins invitro. The suggested system can be used in a high throughput setting andcan be performed as “one-pot” reaction including in vitro transcriptionand/or translation of a given TAL effector protein followed by on-linedetection of TAL DNA binding. One embodiment of this assay isillustrated in FIG. 4F. The open reading frame encoding a TAL effectorthat is to be tested may, e.g., be provided in a plasmid or as PCRfragment flanked by a promoter (e.g., a T7 promoter) or as RNA moleculeand may be subject to in vitro transcription and/or in vitrotranslation, e.g., according to one of the methods described above (e.g. using wheat germ or E. coli lysate).

The in vitro translated TAL effector protein is incubated with a pair ofoligonucleotides (sense & antisense) harboring a specific TAL bindingsite (see FIG. 4F). At least one of both oligonucleotides is designed tocontain terminal sequences that are not required for TAL effectorbinding and are able to hybridize and form an intramolecular stem-loopstructure. The TAL binding site may comprise at least between 4 and 10,between 8 and 15, between 12 and 20, between 15 and 26, between 20 and30 nucleotides. For example, the TAL binding site may comprise 19 or 25nucleotides. In some aspects the TAL binding site may start with a “T”.In some instances the terminal sequences not required for TAL effectorbinding may comprise between 4 and 7, between 5 and 9, between 8 and 15,between 10 and 20 nucleotides. An optimal length of the terminalsequences may be determined depending on the length and/or compositionof the TAL binding site, e.g., by computer-assisted means. In oneembodiment, one end of the first oligonucleotide (e.g., sense) may beattached to a reporter fluorophore (e.g., FAM), whereas the other endmay be attached to a non-fluorescent quencher moiety (e.g., BHQ-1).

Any fluorophore labels known in the art can be used in the invention andmay be chosen according to their excitation and emission spectra.Suitable fluorophores include without limitation FAM, TET, CAL FluorGold 540, HEX, JOE, VIC, CAL Fluor Orange 560, Cy3, NED, Quasar 570,Oyster 556, TMR, CAL Fluor Red 590, ROX, LC red 610, CAL Fluor Red 610,Texas red, LC red 640, CAL Fluor Red 635, Cy5, LC red 670, Quasar 670,Oyster 645, LC red 705, Cy5.5 etc. For contact quenching anynon-fluorescent quencher can serve as acceptor of energy from thefluorophore. Quencher molecules that can be used in the inventioninclude without limitation DDQ-I, Dabcyl, Eclipse, Iowa Black FQ, BHQ-1,QSY-7, BHQ-2, DDQ-II, Iowa Black RQ, QSY-21, BHQ-3, etc.

When the oligonucleotide forms a stem-loop the fluorophore and quenchermoiety are brought into close proximity, allowing energy from thefluorophore to be transferred directly to the quencher through contactquenching. This molecular beacon will initially be in equilibriumbetween its closed stem-loop conformation that allows for quenching ofthe signal and an open state where the stem-loop structure dissociatesthereby separating the fluorophore and the quencher from each other. Inthe open conformation sense and antisense oligonucleotides hybridize toform a double strand structure that allows for signaling of the freefluorophore. With an increasing amount of TAL effector protein bindingto the oligonucleotide pair the open state confirmation will bestabilized and dominate in the population which leads to a measurablesignal increase over time.

Thus in one aspect, the invention relates to an assay for analysis ofTAL effector binding wherein the assay contains at least (i) a TALeffector protein, (ii) a first oligonucleotide that contains a TALbinding site and terminal sequences capable of forming a stem-loopstructure, wherein one end of the oligonucleotide is associated with afluorophore molecule and the other end of the oligonucleotide isassociated with a quenching molecule, (iii) a second oligonucleotidewith a sequence that is capable of annealing to said firstoligonucleotide, wherein a measurable signal is obtained when at least aportion of the first and second oligonucleotides are annealed, andwherein binding of the TAL effector protein to the TAL binding sitefavors annealing of the first and second oligonucleotides.

As negative control, a parallel binding reaction with an unrelated pairof oligonucleotides may be performed. As the signal strength depends onthe ratio of oligonucleotides present in a stem-loop or openconformation, the assay allows for quantitative evaluation of TALeffector binding. The method of the invention may also be performed withthe following variations: In a first alternative embodiment, a quenchingeffect may be achieved in the open conformation when a fluorophore isattached, e.g., to the 3′ end of the sense oligonucleotide and thequencher is attached to the 5′ end of the antisense oligonucleotide (orvice versa). In this case TAL effector binding would lead to adecreasing signal. In another alternative embodiment, fluorescenceresonance energy transfer (FRET) can be used to track oligonucleotideconformation. FRET is a distance-dependent interaction between theelectronic excited states of two dye molecules in which excitation istransferred from a donor molecule to an acceptor molecule withoutemission of a photon. This interaction only occurs when donor andacceptor molecules are in close proximity.

Thus, the invention also relates to an assay for analysis of TALeffector binding wherein the assay contains at least (i) a TAL effectorprotein, (ii) a first oligonucleotide that contains a TAL binding siteand terminal sequences capable of forming a stem-loop structure, whereinone terminal end of the oligonucleotide is associated with a first FRETmolecule (donor or acceptor), (iii) a second oligonucleotide with asequence that is capable of annealing to said first oligonucleotide,wherein one terminal end of the second oligonucleotide is associatedwith a second FRET molecule (donor if first FRET molecule is an acceptorand acceptor if first FRET molecule is a donor), wherein a measurableFRET signal is obtained when at least a portion of the first and secondoligonucleotides are annealed, and wherein binding of the TAL effectorprotein to the TAL binding site favors annealing of the first and secondoligonucleotides. In an alternative embodiment, FRET acceptor and donormolecules can be attached to both ends of one oligonucleotide. In thiscase TAL effector binding would lead to a decreasing FRET signal. Forthe design of annealing fluorescent oligonucleotides using FRET,fluorophore-quencher pairs that have sufficient spectral overlap shouldbe chosen. Different donor/acceptor pairs known in the art can be usedin this assay including, e.g., Fluorescein/Tetramethylrhodamine,IAEDANS/Fluorescein, EDANS/Dabcyl, Fluorescein/Fluorescein, BODIPYFL/BODIPY FL, Fluorescein/QSY 7, QSY 9 dyes etc. In most applications,the donor and acceptor dyes are different, in which case FRET can bedetected by the appearance of sensitized fluorescence of the acceptor orby quenching of donor fluorescence. When the donor and acceptor are thesame, FRET can be detected by the resulting fluorescence depolarization.

In certain instances, in vitro translation of the TAL effector proteinmay be observed real-time by using fluorescent reagents that are capableof interacting with the translated protein and change their fluorescentproperties upon binding. Such fluorescent reagents may, e.g., includesmall molecules, interacting with a protein tag (such as, e.g.,His-tag), fluorescently labeled aptamers or fluorophore/quencher or FRETsystems coupled to antibodies, single chain antibodies or aptamers oranticalins which may bind to conserved domains of the TAL effectorproteins. For example, the reagents may be designed to bind pairwise toadjacent loops in the TAL repeat domain which may lead to quenching/FRETsignaling thereby changing the signal obtained in the unbound state.

Additional methods suitable for use in the practice of the invention fordetecting the sequence-specific binding of proteins to nucleic acids,including nitrocellulose filter binding, DNaseI foot printing,methylation protection, and methylation interference.

In many instances it will be desirable to employ in vivo assays of TALfunction. This will likely be so when, for example, one wishes to use aTAL effector or TAL effector fusion in a particular cell. In vivo assaysmay fall into two categories based upon either inhibition andactivation.

Inhibition assays are useful for, for example, detecting intracellularTAL effector or TAL effector fusion binding activity. An inhibitionassay may be designed in which a TAL effector binding site is located,for example, between a promoter and a reporter gene. The reporter may beregulatable of constitutive and TAL effector binding activity may bemeasured by the suppression of transcription (e.g., suppression ofreporter protein or mRNA production). Further, differential measurementof transcriptional suppression may be used to assay the TAL bindingstrength (e.g., affinity of a TAL effector for a specific nucleotidessequence). Thus, the invention includes, in part, methods for screeningthe binding activity of TAL effectors, these methods comprising thefollowing:

-   -   (a) generating nucleic acid molecules encoding a population of        TAL effectors or TAL effector fusions with identical TAL repeats        but differing in the amino flanking region and/or the carboxyl        flanking region;    -   (b) introducing the nucleic acid molecules into cells (e.g., a        mammalian cell such as 293, HeLa, CHO, etc., cells) containing a        TAL effector binding site located between a promoter and a gene        (e.g., a reporter gene) under conditions suitable for expression        of the encoded TAL effectors or TAL effector fusions; and    -   (c) comparing cellular expression levels of the gene either (i)        in the same cells before or after TAL effector expression        or (ii) in different cells which express the TAL effectors and        do not express the TAL effectors.

An activation assay is one in which an activity of TAL effectors or TALeffector fusions other than nucleic acid binding activity is measured.One example is where nucleic acid molecules are generated encoding apopulation of TAL effector fusions wherein the fusion partner is atranscriptional activator (e.g., VP16, VP64, etc.) are screened todetermine transcriptional activation activity of population members.Thus, the invention includes, in part, methods for screening TALeffector fusions for transcriptional activation activity, these methodcomprising:

-   -   (a) generating nucleic acid molecules encoding a population of        TAL effector fusions with identical TAL repeats but differing in        the amino flanking region and/or the carboxyl flanking region;    -   (b) introducing the nucleic acid molecules into cells (e.g., a        mammalian cell such as 293, HeLa, CHO, etc., cells) containing a        TAL effector binding site located between a promoter and a gene        (e.g., a reporter gene) under conditions suitable for expression        of the TAL effector fusions; and    -   (c) comparing cellular expression levels of the gene either (i)        in the same cells before or after TAL effector fusion expression        or (ii) in different cells which express the TAL effector        fusions and do not express the TAL effector fusions.

The population of TAL effector fusions may, e.g., be TAL effectorfusions with modified or truncated N- and or/C-terminal flankingregions. Fully assembled TAL effector proteins comprising at least acentral repeat domain and an amino- and carboxyl-terminal domain maycomprise more than 800 amino acid residues. In some instances it may bebeneficial to identify the minimal terminal ends required for TALeffector binding in order to reduce the size of engineered TAL effectorsor large TAL effector fusions. TAL effectors with truncated N- and/orC-terminal ends have been demonstrated to be functional in the contextof fusion proteins including the truncated TAL effector nucleasesdescribed in this document. One strategy to identify minimal functionalN- and C-terminal domains of TAL effectors is described in Zhang et al.(“Efficient construction of sequence-specific TAL effectors formodulating mammalian transcription”. Nat. Biotechnol. 2011 February;29(2):149-53.). The authors used a program to predict the secondarystructure of TAL N- and C-termini and introduced truncations atpredicted loop regions. However, the development of novel engineered TALeffectors with tailored TAL repeats for any given DNA target sequencemay require a more systematic approach to identify minimal and/oroptimal N-terminal and C-terminal domains which support TAL bindingactivity.

In one aspect the invention includes a strategy to identify functionalTAL truncations from a truncation library. A library that contains allpossible combinations of TAL N- and C-terminal truncations flanking agiven central repeat domain can be obtained by a method comprising atleast the following steps: (1) generating a series of A-fragments eachencoding at least part of a TAL N-terminus and a 5′ moiety of a TALrepeat domain; (2) generating a series of B-fragments each encoding atleast part of a TAL C-terminus and a 3′ moiety of a TAL repeat domain;(3) cleaving the plurality of A-fragments and B-fragments to obtaincompatible overhangs that allow for (i) combination of any A-fragmentwith any B-fragment and (ii) directed insertion of the resultingcombinations of A- and B-fragments into a target vector and (4) ligatingthe combinations of A- and B-fragments into said target vector to obtaina vector library. Optionally, the method may further comprise insertingthe vector library into a host cell (see FIG. 5).

The series of A-fragments and B-fragments representing step-wisetruncations of TAL N- and C-termini may be generated either by de novogene synthesis as described elsewhere herein or may be obtained bytemplate-dependent PCR. For example truncations of the N-terminus may beintroduced using a series of primer pairs wherein a forward primer bindsinside the N-terminus encoding region and a reverse primer binds withinthe central repeat domain coding region of a TAL effector template DNA.Step-wise truncations may occur amino acid-wise (every primer is shiftedby one codon) or may be performed in larger steps (e.g., each primer isshifted by 5 to 10 or more amino acids). In one embodiment the A- andB-fragments are designed to contain type II or type IIS cleavage sitesat the 5′ and 3′ ends. In one embodiment, the 3′ ends of the A-fragmentsand the 5′ ends of the B-fragments contain type IIS cleavage siteswhereas the 5′ ends of the A-fragments and the 3′ ends of theB-fragments may contain either type II or type IIS cleavage sites. Forexample, when A- and B-fragments are generated by PCR, the cleavagesites may be introduced via terminal amplification primers. In someembodiments, the overhangs resulting from type IIS cleavage at the 3′ends of the A-fragments may be compatible with the overhangs resultingfrom type IIS cleavage at the 5′ ends of the B-fragments but may not becompatible with the overhangs resulting from cleavage at the 5′ ends ofthe A-fragments. Likewise, the overhangs resulting from type IIScleavage at the 5′ ends of the B-fragments may be compatible with theoverhangs resulting from type IIS cleavage at the 3′ ends of theA-fragments but may not be compatible with the overhangs resulting fromcleavage at the 3′ ends of the B-fragments. This strategy may be used toavoid combinations of an A-fragment with another A-fragment or of aB-fragment with another B-fragment thereby excluding nonsensecombinations from the library.

The obtained library of A-fragments with B-fragments collectivelyreferred to as “length variants” may be inserted into a target vector(e.g., a functional vector) under control of a promoter region thatallows for expression in the target host. The vector may be designed toprovide a coding sequence for a TAL effector fusion downstream of theinserted length variants so that a library of TAL effector fusionproteins is expressed. The fusion domain may, e.g., be an activatordomain, a repressor domain, a nuclease domain or any other suitabledomain. Furthermore, the vector may contain a reporter gene cassette inproximity to one or more TAL binding sites that can be bound byfunctional length variants of the TAL effector fusion proteins (see FIG.5).

Thus the invention also relates to a vector containing at least thefollowing elements: (i) a TAL effector insertion site for insertion of aTAL effector sequence flanked by type II or IIS cleavage sites, (ii) apromoter region upstream of the insertion site, (iii) a sequenceencoding a TAL effector fusion domain downstream of the TAL effectorinsertion site, (iv) at least one selection marker, (v) one or moreinsertion sites for one or more copies of a TAL binding site flanked bytype II or IIS cleavage sites, (vi) a reporter gene cassette composed ofat least a promoter region and a reporter open reading frame,optionally, at least one primer binding site flanking the TAL effectorinsertion site. One or more copies of TAL binding sites may, e.g., beprovided in the form of annealed oligonucleotides designed to haveterminal overlaps that are compatible with the overhangs generated bytype II or type IIS cleavage in the target vector. The aforementionedvector is not limited to the testing of truncated TAL effector proteinsbut can be used as binding reporter system, e.g., in a high throughputsetting to validate binding of engineered TAL effector proteins to apredicted binding site in vivo. As reporter gene, any gene may be usedthat allows for identification of cells harbouring a functional TALeffector protein. Reporter genes that may be used in the vector systeminclude without limitation gfp, rfp, luciferase or a resistance markergene suitable for a given host as described elsewhere in this document.

In an alternative embodiment, TAL effector fusion variants may beprovided in a first vector and the reporter gene expression cassette andthe TAL binding site(s) may be provided in a second vector. Furthermore,the invention relates to a library of vectors wherein each vectorcarries a truncated variant of a TAL effector sequence.

The library of vectors containing the different length variants may beinserted into host cells to allow for expression of the TAL effectorfusion proteins. Thus, in one aspect the invention also relates to ahost cell library and the use thereof to identify functional TALeffector truncations. Suitable methods for host transformation ortransfection are described elsewhere herein. In certain instances it maybe desirable to stably transfect host cells with the library, e.g., byusing integration systems that provide recombinases such as, e.g., Cre,Flp or PhiC31 to integrate one copy of a vector in a defined genomicregion (see, e.g., FLP-IN™ or JUMP-IN™ from Life Technologies (Carlsbad,Calif.). In certain instances it may be advantageous use an induciblepromoter for regulated TAL effector expression. Following expression ofthe TAL effector fusion library, only the functional truncated variantswill bind to the target binding sites and modulate the expression of thereporter gene by the activity of the fusion domain. Detection ofreporter gene activity can be used to identify host cells carrying afunctional TAL effector fusion protein. In one embodiment the reportergene may be a fluorescent gene such as, e.g., gfp or rfp and theeffector fusion domain may be an activator domain (e.g., VP64).

The invention thus provides methods for integrating expressionconstructs (e.g., nucleic acid molecules which encode TAL effectors)into the genome of a cell. As noted above, integration systems whichemploy recombinases may be used in such methods. Other methods (e.g.,homologous recombination may also be used).

Genome integration may be random or site specific. When randomintegration is employed, cells that have potentially integrated nucleicacid into their genome may be screened for nucleic acid expression(e.g., selectable marker expression, mRNA levels, etc.). Thus, insertexpression levels may be sued to identify cells which have incorporatednucleic acid in a region of the genome which allows for suitableexpression levels (e.g., regions of open chromatin structure ineukaryotic cells).

When site specific integration is used with the goal of expression ofinserted nucleic acid, then generally it will be desirable to insert thenucleic acid in a site which allows for expression. Further, such sitesvary with factors such as, for example, the organism, the cell type, andstage of development. Examples of site-specific integration sites whichmay be used include the human PPP1R12C (e.g., in the intronic regionbetween exons 1 and 2), AAVS1 and CCR5 (e.g., in the region overlappingthe intron between exons 2 and 3 and the exon 3 coding region) loci asdescribed in more detail elsewhere herein.

The invention also provides cell lines which are designed for sitespecific integration of exogenous nucleic acid into their genomes. Thesecell lines may be contain one or more recombination or pseudorecombination site in their genome, Typically, such sites will beselected or structured in such a manner as to allow for insertion ofnucleic acids (see, e.g., Chesnut et al., U.S. Patent Publication No.2008/0216185 A1, the disclosure of which is incorporated herein byreference).

In some embodiments, recombination sites will be introduced into thegenome. Such introduction may be site specific or random. Further, cellwhich have random acquired recombination sites may then be screened todetermine whether one or more recombination sites have been introducedin a location suitable for a particular purpose (e.g., transcription ofa coding sequence integrated at the locus). Site specific recombinationsites may introduced specifically at locations known to be suitable fora particular purpose.

In additional embodiments, where integration of nucleic acid intospecific regions of a genome is desired, sites with functional homologyto site-specific recombination sites (pseudo recombination sites) can beidentified and used. These sites may be used to target the insertion ofnucleic acids to a desired region. Pseudo recombination sites which maybe used for this purpose include, but are not limited to, thoserecognized by the recombinases phiC31, R4, phi80, P22, P2, 186, P4 andP1. A large number of genomes have been sequenced. These sequence datamay be searched to identify pseudo recombination sites and determinewhether they are potentially suitable for a particular purpose. Thus,the invention includes bioinformatic screening to identify pseudorecombination sites for site specific integration of nucleic acids intogenomes.

In another embodiment the reporter gene may be a resistance gene andcells carrying a functional TAL effector fusion will survive underselective pressure. When a resistance marker gene is used as reporter itmay be possible to select for better binders by increasing the selectivepressure. Upon binding of a functional truncated TAL effector variant tothe TAL binding site reporter expression (e.g., GFP) would be induced bybinding of the activator domain to the upstream promoter region. Theresulting green cells carrying functional truncated TAL effectorvariants can be easily selected and the truncated sequence can beidentified by sequencing via flanking primer binding sites. At least onecontrol vector should be included in the screening (e.g., containing afull-length or tested truncated variant) to ensure binding of the repeatdomain to the predicted TAL binding site(s).

Reporter constructs suitable for use with the invention are describedelsewhere herein. Further, such reporters may be used to isolate cellsby, for example, fluorescent activated cell sorting (FACS) based uponexpression activation level. In addition, nucleic acid molecule encodingTAL effectors and TAL effector fusions may then be isolated from cellsafter the cells have been screened for TAL activity. Thus, the methodsinclude screening method for identifying functional activities of TALeffectors and TAL effector fusions and isolating nucleic acid moleculeswhich encode these proteins. In particular, these methods includeisolation methods which allow for the isolation of individual which havebeen shown to encode proteins having specific functional activitiesand/or specific levels of a particular functional activity.

TAL nucleases. TAL effectors may for example be fused with sequencesencoding nuclease activities. For example, the TAL effectorfusion-encoding nucleic acid sequences are sequences encoding a nucleaseor a portion of a nuclease, typically a non-specific cleavage domainfrom a type IIS restriction nuclease such as FokI (Kim et al. (1996)Proc. Natl. Acad. Sci. USA 93:1156-1160). The FokI endonuclease wasfirst isolated from the bacterium Flavobacterium okeanokoites. This typeIIS nuclease has two separate domains, the N-terminal DNA binding domainand C-terminal DNA cleavage domain. The DNA binding domain functions forrecognition of a non-palindromic sequence 5′-GGATG-3′/5′-CATCC-3′ whilethe catalytic domain cleaves double-stranded DNA non-specifically at afixed distance of 9 and 13 nucleotides downstream of the recognitionsite. FokI exists as an inactive monomer in solution and becomes anactive dimer following the binding to its target DNA and in the presenceof some divalent metals. As a functional complex, two molecules of FokIeach binding to a double stranded DNA molecule dimerize through the DNAcatalytic domain for the effective cleavage of DNA double strands. Thus,as noted below, TAL effector fusions employing enzymes such as FokI willtypically be introduced into cells and expressed as pairs. In manyinstances, these pairs will bind different nucleotide sequences, spacedin a manner to allow for dimerization of the FokI fusion components.

Other useful nucleases may include, for example, HhaI, HindIII, NotI,BbvC1, EcoRI, BglI, and AlwI. The fact that some nucleases (e.g., FokI)only function as dimers can be capitalized upon to enhance the targetspecificity of the TAL effector. For example, in some cases each FokImonomer can be fused to a TAL effector sequence that recognizes adifferent DNA target sequence, and only when the two recognition sitesare in close proximity do the inactive monomers dimerize to create afunctional enzyme. By requiring DNA binding to activate the nuclease, ahighly site-specific restriction enzyme can be created. Asequence-specific TAL effector nuclease can recognize a particularsequence within a preselected target nucleotide sequence present in ahost. Thus, in some embodiments, a target nucleotide sequence can bescanned for nuclease recognition sites, and a particular nuclease can beselected based on the target sequence. In other cases, a TAL effectornuclease can be engineered to target a particular cellular sequence. Anucleotide sequence encoding the desired TAL effector nuclease can beinserted into any suitable expression vector, and can be linked to oneor more expression control sequences. For example, a nuclease codingsequence can be operably linked to a promoter sequence that will lead toconstitutive expression of the nuclease in the species of plant to betransformed. Alternatively, a nuclease coding sequence can be operablylinked to a promoter sequence that will lead to conditional expression(e.g., expression under certain nutritional conditions).

The cleavage domain portion of the fusion proteins disclosed herein canbe obtained from any endo- or exonuclease. Exemplary endonucleases fromwhich a cleavage domain can be derived include, but are not limited to,restriction endonucleases and homing endonucleases. See, for example,2002-2003 Catalogue, New England Biolabs, Beverly, Mass.; and Belfort etal. (1997) Nucleic Acids Res. 25:3379-3388. Additional enzymes whichcleave DNA are known (e.g., 51 Nuclease; mung bean nuclease; pancreaticDNase I; micrococcal nuclease; yeast HO endonuclease; see also Linn etal. (eds.) Nucleases, Cold Spring Harbor Laboratory Press, 1993). One ormore of these enzymes (or functional fragments thereof) can be used as asource of cleavage domains.

Restriction endonucleases (restriction enzymes) are present in manyspecies and are capable of sequence-specific binding to DNA (at arecognition site), and cleaving DNA at or near the site of binding.Certain restriction enzymes (e.g., Type IIS) cleave DNA at sites removedfrom the recognition site and have separable binding and cleavagedomains. For example, the Type IIS enzyme FokI catalyzes double-strandedcleavage of DNA, at 9 nucleotides from its recognition site on onestrand and 13 nucleotides from its recognition site on the other. See,for example, U.S. Pat. No. 5,487,994; as well as Li et al. (“Functionaldomains in Fok I restriction endonuclease”) Proc. Natl. Acad. Sci. USA89:4275-4279. Thus, in one embodiment, fusion proteins comprise thecleavage domain (or cleavage half-domain) from at least one Type IISrestriction enzyme.

Accordingly, for the purposes of the present disclosure, the portion ofthe FokI enzyme used in the disclosed fusion proteins is considered acleavage half-domain. A cleavage domain or cleavage half-domain can beany portion of a protein that retains cleavage activity, or that retainsthe ability to multimerize (e.g., dimerize) to form a functionalcleavage domain. Thus, for targeted double-stranded cleavage and/ortargeted replacement of cellular sequences using TAL-FokI fusions, twofusion proteins, each comprising a FokI cleavage half-domain, can beused to reconstitute a catalytically active cleavage domain.

Multiple parameters may influence the catalytic activity of nucleasefusion proteins such as TAL effector FokI fusions.

For purposes of amino acid sequence reference, the FokI amino acidsequence found in GenBank accession number AAA24934 is used herein andset out below (SEQ ID NO: 69):

  1 MVSKIRTFGW VQNPGKFENL KRVVQVFDRN SKVHNEVKNI KIPTLVKESK IQKELVAIMN 61 QHDLIYTYKE LVGTGTSIRS EAPCDAIIQA TIADQGNKKG YIDNWSSDGF LRWAHALGFI121 EYINKSDSFV ITDVGLAYSK SADGSAIEKE ILIEAISSYP PAIRILTLLE DGQHLTKFDL181 GKNLGFSGES GFTSLPEGIL LDTLANAMPK DKGEIRNNWE GSSDKYARMI GGWLDKLGLV241 KQGKKEFIIP TLGKPDNKEF ISHAFKITGE GLKVLRRAKG STKFTRVPKR VYWEMLATNL301 TDKEYVRTRR ALILEILIKA GSLKIEQIQD NLKKLGFDEV IETIENDIKG LINTGIFIEI361 KGRFYQLKDH ILQFVIPNRG VTKQLVKSEL EEKKSELRHK LKYVPHEYIE LIEIARNSTQ421 DRILEMKVME FFMKVYGYRG KHLGGSRKPD GAIYTVGSPI DYGVIVDTKA YSGGYNLPIG481 QADEMQRYVE ENQTRNKHIN PNEWWKVYPS SVTEFKFLFV SGHFKGNYKA QLTRLNHITN541 CNGAVLSVEE LLIGGEMIKA GTLTLEEVRR KFNNGEINF

FokI nuclease cleavage domains with increased cleavage activityconsisting of two amino acid mutations S418P and K441E and referred toas “Sharkey” were generated employing a directed evolution strategy asdescribed in Guo et al., (2010) (“Directed Evolution of an Enhanced andHighly Efficient FokI Cleavage Domain for Zinc Finger Nucleases”;Journal of Molecular Biology 400 (1): 96) and U.S. Pat. No. 8,034,598,the disclosure of which is included herein by reference. Other mutationswere shown to improve dimer enzyme specificity or enzyme activity eitheralone or in combination. Some of the mutations resulting in modifiedFokI cleavage domain activity are without limitation: KKR (E490K, I538K,H537R), ELD (Q486E, I499L, N496D), RR (R487D, N496D, D483R, H537R). Thusthe methods and compositions disclosed herein also relate in part to TALeffector fusions comprising an engineered FokI cleavage half-domain,wherein the engineered cleavage half-domain comprises a mutation in oneor more wild-type amino acid residues 483, 486, 487, 490, 496, 499, 537,538, or combinations thereof, and wherein the engineered cleavagehalf-domain forms an obligate heterodimer with a wild-type cleavagehalf-domain or a second engineered cleavage half-domain.

Furthermore, the invention relates in part to TAL effector nucleasefusion proteins and optimized sequences encoding such proteins. Inparticular the invention includes TAL effectors with codon-optimizednuclease sequences or nuclease cleavage domains such as those encoded bySEQ ID NOs: 1 to 3 (see FIG. 6A).

Additional restriction enzymes also contain separable binding andcleavage domains, and these are contemplated by the present disclosure.See, for example, Roberts et al. (2003) Nucleic Acids Res. 31:418-420.Examples of Type IIS Restriction Enzymes suitable for use with theinvention include the following, many of which are Type IIS enzymes:AarI, BsrBI, SspD5I, AceIII, BsrDI, Sth132I, AciI, BstF5I, StsI, AloI,BtrI, TspDTI, BaeI, BtsI, TspGWI, Bbr7I, CdiI, Tth111II, BbvI, CjePI,UbaPI, BbvII, DrdII, BsaI, BbvCI, EciI, BsmBI, BccI, Eco311, Bce83I,Eco57I, BceAI, Eco57MI, BcefI, Esp3I, BcgI, FauI, BciVI, FinI, BfiI,FokI, BinI, GdiII, BmgI, GsuI, Bpu10I, HgaI, BsaXI, Hin4II, BsbI, HphI,BscAI, Ksp632I, BscGI, MboII, BseRI, MlyI, BseYI, MmeI, BsiI, MnlI,BsmI, Pfl1108I, BsmAI, PleI, BsmFI, PpiI, Bsp24I, PsrI, BspGI, RleAI,BspMI, SapI, BspNCI, BsrI, or SimI.

The disclosed TAL effectors with nuclease function can be used to cleaveDNA at a region of interest in cellular chromatin (e.g., at a desired orpredetermined site in a genome, for example, in a gene, either mutant orwild-type). For such targeted DNA cleavage, TAL repeats are engineeredto bind a target site at or near the predetermined cleavage site, and afusion protein comprising the engineered TAL binding domain and acleavage domain is expressed in a cell. Upon binding of the TAL repeatto the target site, the DNA is typically cleaved near the target site bythe cleavage domain. For targeted cleavage using a TAL effector nucleasefusion protein, the binding site can encompass the cleavage site, or thenear edge of the binding site can be 1, 2, 3, 4, 5, 6, 10, 25, 50 ormore nucleotides (or any integral value between 1 and 50 nucleotides)from the cleavage site. The exact location of the binding site, withrespect to the cleavage site, will depend upon the particular cleavagedomain, and the length of any linker. Thus, the methods described hereincan employ an engineered TAL effector nuclease fusion. In these cases,the TAL effector fusion is engineered to bind to a target sequence, ator near which cleavage is desired. Once introduced into a cell the TALeffector fusion binds to the target sequence and cleaves at or near thetarget sequence.

The exact site of cleavage depends on the nature of the cleavage domainand/or the presence and/or nature of linker sequences between thebinding and cleavage domains. Optimal levels of cleavage can also dependon both the distance between the binding sites of the two fusionproteins (See, for example, Smith et al. (2000) Nucleic Acids Res.28:3361-3369; Bibikova et al. (2001) Mol. Cell. Biol. 21:289-297) andthe length of the linker in each fusion protein. In certain embodiments,the cleavage domain comprises two cleavage half-domains, both of whichare part of a single polypeptide comprising a TAL cassette, a firstcleavage half-domain and a second cleavage half-domain. The cleavagehalf-domains can have the same amino acid sequence or different aminoacid sequences, so long as they function to cleave the DNA.

Further, the TAL repeats bind to target sequences which are typicallydisposed in such a way that, upon binding of the TAL effector fusionproteins, the two cleavage half-domains are presented in a spatialorientation to each other that allows reconstitution of a cleavagedomain (e.g., by dimerization of the half-domains), thereby positioningthe half-domains relative to each other to form a functional cleavagedomain, resulting in cleavage of cellular chromatin in a region ofinterest. Generally, cleavage by the reconstituted cleavage domainoccurs at a site located between the two target sequences.

The two fusion proteins can bind in the region of interest in the sameor opposite polarity, and their binding sites (i.e., target sites) canbe separated by any number of nucleotides, e.g., from 0 to 200nucleotides or any integral value in between. In certain embodiments,the binding sites for two fusion proteins, each comprising a TALeffector and a cleavage half-domain, can be located between 5 and 18nucleotides apart, for example, 5-8 nucleotides apart, or 15-18nucleotides apart, or 6 nucleotides apart, or 16 nucleotides apart, asmeasured from the edge of each binding site nearest the other bindingsite, and cleavage occurs between the binding sites.

The site at which the DNA is cleaved generally lies between the bindingsites for the two fusion proteins. Double-strand breakage of DNA oftenresults from two single-strand breaks, or “nicks,” offset by 1, 2, 3, 4,5, 6 or more nucleotides, (for example, cleavage of double-stranded DNAby native FokI results from single-strand breaks offset by 4nucleotides). Thus, cleavage does not necessarily occur at exactlyopposite sites on each DNA strand. In addition, the structure of thefusion proteins and the distance between the target sites can influencewhether cleavage occurs adjacent a single nucleotide pair, or whethercleavage occurs at several sites. However, for many applications,including targeted recombination and targeted mutagenesis cleavagewithin a range of nucleotides is generally sufficient, and cleavagebetween particular base pairs is not required.

TAL effector fusion(s) can be delivered to cells as polypeptides and/orpolynucleotides as described elsewhere herein. For example, twopolynucleotides, each comprising sequences encoding one of theaforementioned polypeptides, can be introduced into a cell.Alternatively, a single polynucleotide comprising sequences encodingboth fusion polypeptides may be introduced into a cell, for exampleusing one of the vectors shown in FIG. 22.

TAL activators. TAL effector fusions engineered, assembled or used bythe methods or in compositions described herein may further relate topolypeptides or proteins with activator activity. Activation domainsthat may be fused to engineered TAL effectors are for example herpessimplex virus protein 16 (VP16) (Sadowski et al., “GAL4-VP16 is anunusually potent transcriptional activator”, Nature. 1988 Oct. 6;335(6190):563-4., the engineered VP64 activator containing four copiesof the VP16 core motif (Beerli et al., “Toward controlling geneexpression at will: specific regulation of the erbB-2/HER-2 promoter byusing polydactyl zinc finger proteins constructed from modular buildingblocks.” Proc. Natl. Acad. Sci U.S.A. 1998 Dec. 8; 95(25):14628-33.),nuclear factor-KB subunit p65 (Liu et al., “Regulation of an endogenouslocus using a panel of designed zinc finger proteins targeted toaccessible chromatin regions. Activation of vascular endothelial growthfactor A.” J. Biol. Chem. 2001 Apr. 6; 276(14):11323-34.), VP32, VP48,VP80 or other activation domains known in the art. Thus the inventionrelates, in part, to TAL effector activator fusion proteins andoptimized sequences encoding such proteins. In particular, the inventionincludes TAL effectors with codon-optimized activator domains such asthose encoded by SEQ ID NOs: 4 and 5 (see FIG. 6B). Successfulactivation of gene expression by TAL activator fusion proteins has beendemonstrated by the inventors in the context of various reporter assaysystems (FIGS. 15 and 18A) and for the endogenous sox2 gene in HeLacells (FIG. 33B). In this experiment, the promoter region of the sox2gene which encodes a transcription factor for maintaining pluripotentstem cells was targeted by TAL FLVP64 activator fusion proteins. Twodifferent TAL binding domains were designed to bind to the 4643 and 655sites in the Sox2 promoter region (FIG. 33A) and fused to the VP64activation domain. HeLa cells were then transfected with empty vector(pcDNA3), either of the 4643 or 655 specific TAL VP64 activators or amixture of both fusion proteins. The mRNA levels of Sox2 were evaluatedby Taqman assay 72 hours post transfection and normalized to β-actin andfold-induction of Sox2 mRNA was determined. As shown in FIG. 33B Sox2expression could be significantly increased by TAL VP64 targeted to the655 site whereas the expression boost was even more substantial when the4643 site was bound. Further, a synergistic effect on Sox2 geneexpression was observed when both TAL VP64 proteins were present in thetransfected cells. These experiments illustrate the ability of designedTAL activator fusion proteins to efficiently activate endogenous geneexpression.

TAL repressors. TAL effector fusions engineered, assembled or used bythe methods or in compositions described herein may further relate topolypeptides or proteins with repressor activity. Repressor domains thatmay be fused to engineered TAL effectors are for example Krüppelassociated box proteins KRAB, a transcriptional repression moduleresponsible for the DNA binding-dependent gene silencing activity ofhundreds of vertebrate zinc finger proteins (Margolin et al.“Krüppel-associated boxes are potent transcriptional repressiondomains.” Proc Natl Acad Sci USA. 1994 May 10; 91(10):4509-13.), mSin3interaction domain SID (Ayer et al. “Mad proteins contain a dominanttranscription repression domain.” Mol Cell Biol. 1996 October;16(10):5772-81.), ERF repressor domain ERD (Sgouras et al. “ERF: an ETSdomain protein with strong transcriptional repressor activity, cansuppress ets-associated tumorigenesis and is regulated byphosphorylation during cell cycle and mitogenic stimulation.” EMBO J.1995 Oct. 2; 14(19):4781-93.), histone methyltransferase HMT (Snowden etal. “Gene-specific targeting of H3K9 methylation is sufficient forinitiating repression in vivo.” Curr Biol. 2002 Dec. 23;12(24):2159-66.), Gfi-1 (growth factor independent 1 transcriptionrepressor), repressor element 1 (RE1) silencing transcription factorREST or other repressor domains known in the art. Thus the inventionrelates in part to TAL effector repressor fusion proteins and optimizedsequences encoding such proteins. In particular the invention includesTAL effectors with codon-optimized repressor domains such as thoseencoded by SEQ ID NO: 6 (FIG. 6B).

Successful repression of gene expression by TAL repressor fusionproteins has been demonstrated by the inventors in the context ofvarious reporter assay systems (FIGS. 15 and 16) and for the endogenoussox2 gene in HeLa cells (FIG. 33C). In this experiment, HeLa cells weretransfected with either an empty vector (pcDNA3.1), a TAL MCS expressionplasmid targeting the 4643 site in the promoter region of the sox2 genein the absence of a repressor function or a TAL KRAB repressor fusionprotein directed to the 4643 site. Relative mRNA levels of sox2 wereevaluated by Taqman assay 72 hours post transfection and normalized toβ-actin. As shown in FIG. 33C, a significant repression of Sox2expression was achieved by the binding of the TAL effector to the 4643site, whereas an even stronger downregulation was observed in thepresence of the KRAB repressor fusion. This example demonstrates thefunctionality of TAL effectors and TAL repressor fusion proteins forendogenous gene knock-down applications.

Furthermore, in certain instances TAL effectors may be fused with othereffector functions such as a methylase (e.g., DNA-MT), a demethylase(e.g., MDB2b), an acetylase (histone acetylase HAT) or a deacetylase(e.g., histone deacetylase HDAC). Thus the invention relates in part toTAL effectors with chromatin modifying function and optimized sequencesencoding such proteins. In particular the invention includes TALeffectors with codon-optimized sequences encoding methylase,demethylase, acetylase or deacetylase activities.

TAL epigenetic modifiers. In one aspect the invention relates to TALepigenetic modifiers paired with transcriptional activators. Activationor up-regulation of endogenous genes is a key application for TALeffectors. As knowledge of eukaryotic cellular pathways increases,combinations of knock-out, down-regulation, and up-regulation ofparticular genes in a pathway will be key to modulating production of aspecific product or the inducement of a particular phenotype in responseto extracellular stimuli. Up-regulation of silenced genes poses a uniquechallenge since many genes are silenced by virtue of epigeneticmodification such as methylation, acetylation and sequestration of thepromoter region in heterochromatin. A solution to this problem isprovided by a method where a TAL is fused with an epigenetic modifiersuch as e.g., a deacetylase or a demethlyase, etc. and the modifier iscombined with a specific activation domain in the same molecule such asfor example, VP16, VP64, etc.

Combination of these activities in one molecule would, e.g., allowdemethylation of a methylated promoter region by the activity of theepigenetic modifier and subsequent activation of the promoter by thefused activator moiety in an efficient manner. Thus, the inventionincludes, in part, a TAL epigenetic modifier operationally linked with atranscriptional activator domain. In particular, the invention includesa TAL effector fusion protein composed of at least a TAL effector (i.e.,one or more TAL cassettes flanked by N- and C-terminal domains) (BD) ora modified version thereof, a spacer sequence(s) of a defined length, anepigenetic modifier (EM) or a modified variant thereof, a specificactivation domain (AD) or a modified variant thereof and a nuclearlocalization signal (NLS). In one aspect of the invention, the modifiedversion of the TAL effector can be a truncated binding domain whereineither the N-terminus or the C-terminus or both termini have beentruncated. In one aspect of the invention the epigenetic modifier can bea deacetylase, a demethylase, or a truncated or mutated variant thereof.In another aspect, the activation domain may be a natural or syntheticactivation domain. For example, the activation domain may be VP16 or anarray of two, three, four, five, six, seven or eight repeats of the VP16minimal core motif as defined elsewhere herein. In one aspect, theactivation domain may be VP32, VP48 or VP64 or VP80 or modified versionsderived therefrom. The invention further includes differentarchitectures of the TAL epigenetic modifiers wherein the fused moietiescan be connected in different orders.

Thus, in one aspect the invention relates to a functional vectorcontaining a nucleic acid sequence encoding a TAL epigenetic modifier,wherein the vector contains at least nucleic acid sequences encoding (i)a TAL effector (BD), (ii) an epigenetic modifier (EM), (iii) anactivator domain (AD), (iv) a nuclear localization signal (NLS), and (v)one or more spacer sequences(s).

In certain instances one or more of the TAL cassettes in the TALeffector (BD) may contain RVD specifically recognizing methylatedsequences as described elsewhere herein. In one embodiment the RVD NGmay be used to recognize mC where binding to C shall be excluded. In yetanother embodiment RVD N* may be used where binding to both, mC and C isrequired.

The invention includes a vector, as described above, wherein theelements (i) to (v) are arranged in any of the following orders:5′-BDsEMsADsNLS-3′ or 5′-BDsADsEMsNLS-3′ or 5′-EMsBDsADsNLS-3′ or5′-EMsADsBDsNLS-3′ or 5′-ADsBDsEMsNLS-3′ or 5′-ADsEMsBDsNLS-3′ or5′-NLSsBDsEMsAD-3′ or 5′-NLSsBDsADsEM-3′ or 5′-NLSsEMsBDsAD-3′ or5′-NLSsEMsADsBD-3′ or 5′-NLSsADsBDsEM-3′ or 5′-NLSsADsEMsBD-3′ or5′-BDssNLSEMsAD-3′ or 5′-BDsNLSsADsEM-3′ or 5′-EMsNLSsBDsAD-3′ or5′-EMsNLSsADsBD-3′ or 5′-ADsNLSsBDsEM-3′ or 5′-ADsNLSsEMsBD-3′ or5′-BDsEMsNLSsAD-3′ or 5′-BDsADsNLSsEM-3′ or 5′-EMsBDsNLSsAD-3′ or5′-EMsADsNLSsBD-3′ or 5′-ADsBDsNLSsEM-3′ or 5′-ADsEMsNLSsBD-3′.

The invention further relates to a vector as described above wherein (i)the TAL cassettes and/or TAL repeats or (ii) the flanking N- and/orC-terminal domains are truncated or modified. For example, the TALeffector may contain 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 33, 35 ormore cassettes. Further, the number of amino acids encoded by eachcassette may differ in the number of amino acids and may consist of 34,35 or less than 34 amino acid residues. The N-terminal and/or C-terminaldomains may be truncated by 10, 25, 50, 65, 75, 90, 110, 115, 148, 152,161, 175, 182, 195, 202 or more amino acid residues etc. In anotheraspect of the invention, at least one of the domains in the abovedescribed vector may be linked with a sequence encoding a tag, such as apurification or detection tag as disclosed elsewhere herein. The vectoras described above maybe a GATEWAY® vector or a topoisomerase-adaptedvector or a vector as described elsewhere herein. Furthermore, theinvention also relates in part, to a host cell transformed ortransfected with the above described vector and a fusion proteinexpressed from the above described vector.

Nuclear Localization Signal. DNA binding molecules, such as TALeffectors and TAL effector fusions, may be designed for optimal functionin different species or may be directed to different compartments withina cell. In some instances, it may be required to target a DNA bindingmolecule to a cell nucleus. This can be achieved, for example, by usinga nuclear localization signal (NLS). The C-terminal domain of wild-typeTALs usually harbors an NLS to efficiently target the TAL to the nucleusof plant cells. However, when constructing new or modified DNA bindingmolecules it may be desirable to use a species-specific or engineeredNLS. For example if a truncated TAL repeat domain is to be used lackingparts of the C-terminal domain that would naturally harbor the NLS, amodified or heterologous NLS may be incorporated in the truncatedmolecule. It may also be required to change the location of an NLSwithin a protein to achieve optimal accessibility and/or activity.Different NLS are known in the art and may either be species-specific orcompatible with several species. Typically, a NLS consists of one ormore short sequences of positively charged lysines or arginines exposedon the protein surface. Different nuclear localized proteins may sharethe same NLS. In some of the TAL effectors described herein, theoriginal NLS found in the Hax3 TAL effector is included. However, incertain embodiments of the invention the natural TAL NLS may be replacedby a heterologous NLS to optimize efficiency of nuclear import. Aclassical NLS suitable for practicing the invention may, e.g., includeSV40 T-Antigen monopartite NLS, C-myc monopartite NLS or nucleoplasminbipartite NLS or modified or evolved versions recognized by importin-αthereof but may also include non classical NLS known in the art most ofwhich are recognized directly by specific receptors of the importin βfamily without the intervention of an importin α-like protein. In oneaspect of the invention truncated TAL vectors in which the original Hax3NLSs have been removed may be equipped with a slightly modified SV40 NLSsequence for efficient nuclear targeting. The core motif of the SV40 NLSis typically PKKKRKV (SEQ ID NO: 107) or PKKKRKVE (SEQ ID NO: 108). In afirst variant, two glycines were added to the NLS on either site toprovide a flexible linker to increase accessibility of the NLS iflocated at different positions between the fused domains. Furthermore,an aspartate residue was inserted right after the core motif to increaseactivity yielding sequence GGMAPKKKRKVDGG (SEQ ID NO: 28). In anothervariant a glycine-serine linker was attached upstream of the core motifyielding sequence QGSPKKKRKVDAPP (SEQ ID NO: 29). Other variations canbe introduced on either sites of the core motifs to increase activity oraccessibility within the folded protein. Furthermore, an NLS sequencecan be located at different positions within a TAL effector such ase.g., N-terminal or C-terminal of the repeat domain.

Thus, the invention relates in part to TAL effectors or TAL epigeneticmodifiers containing a heterologous NLS different from the original NLSof the TAL protein. Furthermore, the invention relates to TAL effectorsor TAL epigenetic modifiers containing a heterologous NLS different fromthe original NLS of the TAL protein, wherein the TAL domain is atruncated TAL domain. Furthermore, the invention relates to TALeffectors or TAL epigenetic modifiers containing a heterologous NLSdifferent from the original NLS of the TAL protein, wherein the TALdomain is a truncated TAL domain and the NLS is located at theN-terminus of the TAL domain or at the C-terminus of the TAL domain orbetween the TAL domain and the effector domain or between the TAL domainand the epigenetic modifier or between the epigenetic modifier and theactivator domain. In some embodiments, the invention relates to atruncated TAL effector comprising an NLS with core motif PKKKRKVD (SEQID NO: 109), wherein at least one side of the core motif is flanked by aflexible linker sequence.

Organelle targeting of TAL effectors. Nuclear localization signals asdescribed elsewhere herein allow for targeting of TAL effector fusionsto nuclei of various host cells. However, in some instances it may berequired to target TAL effectors to organelles other than the nucleus,such as, e.g., mitochondria or plant chloroplasts. Typical targetingsignals that may direct polypeptides to these organelles are listed inthe following TABLE 12:

TABLE 12 Typical signal location within Organelle polypeptide Nature ofSignal Mitochondrium N-terminal 3 to 5 nonconsecutive Arg or Lysresidues, often with Ser and Thr; no Glu or Asp Chloroplast N-terminalGenerally rich in Ser, Thr, and small hydrophobic residues, typicallypoor in Glu and Asp Nucleus Internal Core motif of 5 basic residues, ortwo smaller clusters of basic residues separated by approx. 10 aminoacids

Specific targeting can be triggered by various mechanisms. In a firstposttranslational mechanism, the import machinery of an organelleselects proteins by recognition of a transit peptide or specificlocalization signals. In a second co-translational mechanism, the signalsequence in the nascent polypeptide binds signal recognition particles(SRPs), which represses further translation and targets this entireRNA-ribosome-nascent polypeptide complex to the endoplasmatic reticulum,where protein translation resumes. In a third mRNA-based mechanism theuntranslated mRNA is localized by an RNA-binding protein associated witha molecular motor or the target membrane, and translation is initiatedafter mRNA localization. All three mechanisms have been shown tocontribute to organelle targeting in both, mammals and algae such as,e.g., Chlamydomonas (Uniacke and Zerges, “Chloroplast protein targetinginvolves localized translation in Chlamydomonas” Proc. Natl. Acad. Sci.USA 2009, vol. 106 no. 5, p. 1439-1444).

In certain instances, it may be required to target a TAL effector or TALeffector fusion to the genome of mitochondria or chloroplasts of plantsfor genomic engineering. In some instances, it may be required to targetTAL effectors to the chloroplast genome of algae or microalgae. A TALeffector that is to be targeted to the chloroplast genome may, e.g., beexpressed in the algae nucleus, translated in the cytosol and thendirected to the chloroplast lumen via a signal sequence in the TALeffector fusion protein. In some aspects, the invention relates to a TALeffector or TAL effector fusion harboring a signal sequence which allowsfor chloroplast targeting. In one embodiment the signal sequence islocated in the N-terminal domain of the TAL effector or TAL effectorfusion. In some aspects, the signal sequence may be multipartite. Forexample the signal sequence may be accompanied by an ER retention signalin cases where the first step in targeting polypeptides to plastidsrequires passage of that polypeptide into the ER. In some aspects, atleast one part of the signal sequence may be rich in serine and/orthreonine residues. Furthermore, at least one part of the signalsequence may comprise small hydrophobic residues. In addition, at leastone part of the signal sequence may be poor in glutamate and/oraspartate residues. In one embodiment the signal sequence may encode theamino acid motif ASAFAP (SEQ ID NO: 110). The signal sequence may bederived from a natural signal sequence or may be an artificial sequenceor composed of several signal sequences. In some embodiments, the signalsequence may contain between 4 and 10, between 8 and 20, between 15 and40, between 20 and 50 amino acid residues.

For expression of a TAL effector or TAL effector fusion protein inalgae, the TAL effector coding sequence containing a chloroplasttargeting signal may, e.g., be cloned in a suitable expression vectorfor microalgae such as, e.g., the pChlamy 1 Vector which is part of the“GENEART ® Chlamydomonas Engineering Kits” offered by Life TechnologiesCorp. (Carlsbad, Calif.). In some embodiments, an algae expressionvector should harbor one or more of the following features: an algaepromoter (e.g., hybrid Hsp70A-RbcS2 promoter), an untranslated regionfor increased mRNA stability (e.g., Cop1 3′-UTR), a versatile multiplecloning site for simplified cloning of the TAL effector or TAL effectorfusion, a resistance gene (e.g., aph7 gene driven by the B2-tubulinpromoter for hygromycin selection), an E. coli selection marker (e.g.,ampicillin, kanamycin etc.) and an origin for maintenance (e.g., pUCori). To achieve stable expression of the TAL effector or TAL effectorfusion, the expression vector is then transformed into algae by methodsknown in the art (e.g., electroporation). Following random integrationinto the algae nuclear genome, the TAL effector or TAL effector fusionwill be expressed in the cytosol and will be delivered to thechloroplast by means of the signal sequence.

Alternatively TAL effector function can be delivered to chloroplasts bydirect expression of TAL effector sequences in the chloroplastcompartment. For direct expression in chloroplasts the TAL effectorencoding sequence and/or sequences encoding fused domains may be codonoptimized. In one embodiment genes to be expressed in chloroplasts arecodon optimized by using preferably codons containing adenine or uracilnucleotides in the third position. Since heterologous proteins expressedin algae may be subject to protease degradation, e.g., by ATP-dependentproteases, the TAL effector design may further include elimination ofpotential protease cleavage sites. The TAL effector or TAL effectorfusion coding sequence is then cloned into a suitable expression vectorthat carries additional elements required for expression by thechloroplast machinery such as, e.g., a suitable promoter, 5′ and 3′ UTRsand a marker gene for selection of transformed cells. For example, asuitable expression vector may contain at least a chloroplast promoter(e.g., psbA or atpA promoter) and a chloroplast terminator (e.g., rbcLterminator) for driving expression of the TAL effector, a marker gene(e.g., bacterial gene aadA conferring spectinomycin and streptomycinresistance) and flanking sequences (e.g., from the psbA gene) forhomologous integration into the chloroplast DNA. The expression vectormay be delivered to the chloroplast by a plastid transformationprocedure known as biolistics using gold carrier particles from SeashellTechnology (La Jolla, Calif.) or by other methods known in the art (see,e.g., Radakovits et al. “Genetic Engineering of Algae for EnhancedBiofuel Production”; Eukaryotic Cell 2010, 9(4):486.). The TAL effectorsequence will then be inserted into the chloroplast genome by homologousrecombination mediated by the flanking sequences.

Fluorescent and Other Detectably Tagged TAL proteins. TAL effectors ofthe invention can be fused to various functional effector molecules asdescribed above to fulfill specific tasks when delivered to a given hostcell. In certain instances it may be desired to either determine (i)where a TAL effector is located in a cell or (ii) how much of a TALeffector is present. Such tracking may help to ensure that a customizedTAL effector is delivered to the predicted place of action at sufficientamounts to fulfill its function. For this purpose TAL effectors can belabeled with a fluorescent or other detectable (e.g., luminescent)portion which allows detection of the TAL effector, e.g., by in vivoimaging. Any fluorescent or other detectable portion or protein known inthe art can be used to tag a TAL protein. In a first aspect, afluorescent moiety may be attached to a TAL effector protein e.g. byproviding a fluorescently labeled antibody specifically binding to a TALor its fused effector function. In another aspect, a fluorescent orother detectable moiety can be directly fused at the amino- orcarboxyl-terminal ends of a TAL effector or a TAL effector fusionprotein. The location of the fluorescent or other detectable moietywithin the fusion protein will mainly depend on the provided effectorfunction and the folding requirements of the fused domains. In aspecific embodiment, a gene encoding a fluorescent or other detectableprotein may be inserted in a TAL effector expression vector so that aTAL fluorescent/detectable fusion protein will be expressed followingdelivery of such expression vector to a target host cell. Anyfluorescent or other detectable protein suitable for in vivo trackingmay be used for that purpose including but not limited to greenfluorescent protein (GFP) or enhanced green fluorescent protein (EGFP),red fluorescent protein (RFP), blue fluorescent protein (BFP), cyanfluorescent protein (CFP), yellow fluorescent protein (YFP),violet-excitable green fluorescent protein (Sapphire) or luciferase. Asequence encoding a fluorescent or other detectable protein may beinserted upstream or downstream of a TAL repeat region or upstream ordownstream of the effector coding region depending on functional andfolding requirements of the provided domains. The gene sequence encodingthe fluorescent or other detectable protein may be a wild-type or acodon-optimized synthetic sequence as described in more detail elsewhereherein. Such fluorescent or other detectable tag may be fused to any TALeffector function described herein including a separate TAL domain, aTAL nuclease or nuclease cleavage half domain, a TAL activator, a TALrepressor, a TAL epigenetic modifier, a TAL polymerase, a TAL scaffoldetc. For example, each TAL nuclease cleavage half domain of a TALnuclease pair may be fused to the same or a different fluorescent orother detectable protein. Use of a different fluorescent protein foreach TAL nuclease cleavage half domain may br used to determine whetherexpression and localization of both domains within a cell is equallybalanced. TAL fluorescent or other detectable protein fusions may beused to help to better understand TAL effector function and activity invivo and may serve to improve and optimize TAL effector design forvarious applications including the methods and applications describedherein.

Methods, Vectors and Kits for Assembly of TAL Effectors

Assembly of DNA binding effector molecules and TAL effectors andcustomized Toolkits. With the advent of the synthetic biology era,homologous recombination has become combined with multiple nucleic acidassembly technologies. Currently, commercially available assembly kitsallow piecing together PCR-amplified or pre-cloned DNA fragments in vivoor in vitro in a single step in a pre-determined and seamless manner.Although these approaches work efficiently with up to 10 fragments thatshare common ends (and in some cases with fragments without end-terminalhomology), many are not robust enough to be used in complex DNAshuffling cloning. Thus, there is a need for novel DNA shufflingassembly strategies and methods and kits based thereon to allow forefficient high throughput assembly and cloning of customized DNA bindingeffector molecules, such as TAL effectors.

A rapid subcloning nucleic acid transfer strategy which allows for thetransfer of nucleic acid segments from one vector into another vector bytype IIS assembly has been proposed referred to as “Golden Gate” cloning(Engler, C., R. Kandzia, and S. Marillonnet. 2008. A one pot, one step,precision cloning method with high throughput capability. PLos One3:e3647.; Kotera, I., and T. Nagai. 2008. A high-throughput andsingle-tube recombination of crude PCR products using a DNA polymeraseinhibitor and type IIS restriction enzyme. J Biotechnol 137:1-7.; Weber,E., R. Gruetzner, S. Werner, C. Engler, and S. Marillonnet. 2011.Assembly of Designer TAL Effectors by Golden Gate Cloning. PloS One6:e19722.). The principles of this type IIS assembly strategy are basedon the ability of type IIS restriction enzymes to cut outside of theirrecognition site. Two or more DNA fragments can be designed to beflanked by a type IIS restriction site such that digestion of thefragments removes the recognition sites of the Type Hs enzymes andgenerates ends with complementary three or four nucleotide overhangsthat can be ligated seamlessly, generating a junction that lacks therecognition sites. A DNA shuffling approach based upon type IIS assemblyalso has been proposed (Engler, C., R. Gruetzner, R. Kandzia, and S.Marillonnet, 2009. Golden gate shuffling: a one-pot DNA shuffling methodbased on type IIS restriction enzymes. PLoS One 4:e5553.; Engler, C.,and S. Marillonnet. 2011. Generation of families of construct variantsusing golden gate shuffling. Methods Mol Biol 729:167-81.). Thestrategy, which permits the generation of libraries of recombinant genesby combining in one reaction several fragment sets prepared fromdifferent parental templates, is also useful for building highlyrepetitive nucleic acid molecules, such as, e.g., TAL effectors (Weber,E., R. Gruetzner, S. Werner, C. Engler, and S. Marillonnet. 2011.Assembly of Designer TAL Effectors by Golden Gate Cloning. PloS One6:e19722.).

Different strategies have been described in the literature to assembleTAL effectors starting with monomeric building blocks (cassettes). Onemethod relies on PCR amplification of the starting material (e.g., TALcassettes) to attach type IIS cleavage site containing adapter sequencesproviding the required individual overhangs (Zhang, F. et al. Efficientconstruction of sequence-specific TAL effectors for modulating mammaliantranscription. Nat. Biotechnol. 29, 149-153 (2011)). This methodinvolves several rounds of PCR and ligation to assemble individualcassettes into 12 cassette TAL effectors. One disadvantage of thismethod is that it is labor intensive and, thus, is not well suited forhigh throughput applications. Other approaches tried to avoidintermediate PCR steps to limit upfront work by assembling many (up toten or twenty) cassettes simultaneously into a given target vector(Morbitzer, R., Elsaesser, J., Hausner, J. & Lahaye, T. Assembly ofcustom TALE-type DNA binding domains by modular cloning. Nucleic AcidsRes. 39:5790-5799 (2011)). Although the successful parallel insertion ofup to 10 fragments by type IIS assembly has been reported, theefficiency and reliability of this method decreases with increasingnumbers of individual fragments. Thus, such method may not be optimal inmany high-throughput settings. By making use of type IIS assembly, theinventors have developed an efficient assembly strategy for TAL nucleicacid binding cassettes starting from a library of TAL cassette trimersrandomly assembled from monomer building blocks which is suitable forthe construction in a manufacturing setting (FIG. 7A). The constructionof a large trimer library (i.e., a collection of individual constructseach carrying a characterized triplet of TAL nucleic acid bindingcassettes) allows for convenient high throughput TAL assembly processes.This is so because once a trimer library is generated, only a few stepsneed be performed with limited amounts of larger parts to generatevarious TAL effectors with different numbers of cassettes andrecognizing different nucleotides sequences. By using the trimer libraryas starting material for all higher order assembly steps, two sets ofonly 3 or 4 trimer TAL cassettes are required to assemble TAL effectorswith 18 or 24 cassettes, respectively. The parallel assembly of only 3or 4 DNA fragments is a very reliable process with a high probability ofpicking a correct clone which avoids tedious screening procedures andrepetition of experiments.

Furthermore the trimer library is based on an innovative design of theunderlying TAL cassettes. A TAL cassette library usually contains atleast four different categories of cassettes (e.g., NI, NK, HD, NG etc.)wherein all cassettes of one category bind a specific nucleotide (eitherA or G or C or T) (see FIG. 7A). One of the four cassettes or anadditional cassette may be designed to bind either both mC and C orspecifically bind mC only as described elsewhere herein. In addition,each category of cassettes may contain at least one shorter cassetteencoding a so-called half-repeat. In some embodiments, the cassettes aredesigned such that one or more cassettes of a category can be recycled,i.e., they can be allocated to different positions of a TAL effectorthereby reducing the total amount of cassettes that have to besynthesized. For example, a TAL effector with an array of 17.5 or 23.5repeats can be assembled from a library that contains less than 24different cassettes per category (e.g., 23, 22, 21, 20, 19, 18, 17, 16,15, 14, 13, 12, 11 or 10). In one example, a library of cassettes thatreflects categories NI, NK, HD and NG contains as few as 11 differentcassettes per category resulting in a total of 44 (11×4) cassettesrequired to assemble all possible combinations of 17.5 or 23.5 repeatsof a TAL effector with 4 cassettes (one per category) representing thehalf-repeats at positions 18 or 24 and 40 cassettes (10 per category)representing all other positions. Those cassettes encoding the 17.5 and23.5 half-repeats, as well as other half repeats, may provide a 3′overhang that allows only direct assembly with a compatible overhang ofa capture vector but with no other cassette of the library. In someinstances the half repeat encoding cassette may already be part of acapture vector.

Thus, in one aspect the invention relates to a library of cassettes forassembly of a TAL effector with between 6 and 25 cassette positions(e.g., 18 or 24 positions), wherein the library of cassettes contains atleast four different categories of cassettes with all cassettes of onecategory binding a specific nucleotide and wherein each cassette can beallocated to one or more distinct positions of the between 6 and 25cassette positions (e.g. A₁-A₂₅, G₁-G₂₅, C₁-C₂₅, T₁-T₂₅), and whereinthe one or more distinct positions are determined by complementaryoverhangs between cassettes.

Furthermore the invention relates to embodiments of the above library ofcassettes, wherein the complementary overhangs are generated by type IIScleavage and/or at least one cassette in each category encodes a halfrepeat and/or wherein at least 2, 3, 4, 5, 6, 7, 8 or 9 cassettes ofeach category can be allocated to more than one distinct position of thebetween 6 and 24 cassette positions. The invention further relates tothe use of the above library for assembly of a TAL effector or TALeffector fusion. In certain instances the TAL effector may comprise 18cassette positions. In other instances the TAL effector may comprise 24cassette positions. Furthermore, the invention includes use of the abovelibrary for the assembly of a trimer library. In one embodiment, thenucleotide overhangs between different cassettes are designed to reflectan optimal sequence diversity. In some instances this may be a maximumsequence diversity. This can, e.g., be achieved by shifting the bordersbetween cassettes by one or more nucleotides which may result incassettes of different lengths. For example to generate an optimaloverlap sequence between cassette A and cassette B, the border betweenboth cassettes is shifted by one or more nucleotides in either directionresulting in one of both cassettes with a shorter nucleotide sequenceand the other one with a longer nucleotide sequence. Thus, the inventionfurther relates to a library of cassettes for assembly of TAL effectorswherein the library comprises standard and non-standard cassettes andwherein a standard cassette contains n×3 nucleotides encoding nresidues, wherein n is a number between 10 and 35, and wherein anon-standard cassette contains n×3−x or n×3+x nucleotides, wherein x isa number between 1 and 7, between 5 and 10, between 8 and 15, between 12and 30 or between 30 and 50. For example a standard cassette may consistof 34 residues encoded by 34×3=102 nucleotides whereas a non-standardcassette may consist of either 34×3−x (=less than 102 nucleotides) or34×3+x (=more than 102 nucleotides). This strategy allows for generationof overhangs between cassettes with maximum diversity in nucleotidecomposition to improve efficiency of the assembly step.

Furthermore, the invention relates to a two-step method for assembling afunctional vector comprising a TAL effector composed of n TAL cassetteswherein the method starts with a library of TAL cassette trimerscomprising all possible combinations of three cassettes wherein eachcassette is capable of specifically binding one nucleotide said methodbeing characterized by the following steps: (i) in a first stepperforming a first reaction wherein cassettes 1 to n/2 are concurrentlycloned in a first capture vector using n/6 trimers, performing at leasta second reaction wherein cassettes (n/2+1) to n are concurrently clonedin a second capture vector using n/6 trimers, wherein in the firstcapture vector cassettes 1 to n/2 are flanked by a first and a secondtype IIS cleavage site and in the second capture vector cassettes(n/2+1) to n are flanked by a second and a third cleavage site, andwherein the first, second, and third cleavage sites provide differentoverhangs when cleaved with one or more restriction enzymes and (ii)performing a third reaction wherein at least cassettes 1 to n/2 andcassettes (n/2+1) to n are released from the at least first and secondcapture vector in the presence of one or more, preferably the same typeIIS restriction enzyme and are cloned in directed order via compatibleends of the first, second and third cleavage sites into a functionalvector that provides overhangs compatible with the first and the thirdcleavage site (FIG. 7B). In one embodiment, the functional vector may beprovided in a linearized form. In another embodiment, the functionalvector may be provided in a closed circular form and may be cleavedtogether with the at least first and second capture vector in the samereaction. In yet another embodiment, the at least first and secondcapture vector and the functional vector are cleaved by the same typeIIS restriction enzyme.

The two-step method may further be characterized in that the at leasttwo reactions of step (i) are performed in parallel. Furthermore, insome embodiment, no PCR step is involved in either of steps (1) or (2).The assembly reaction in step (i) and/or step (ii) may be performed inthe presence of a ligase such as, e.g., a T4 or Taq ligase. In someinstances, at least one overhang in the reactions in step (i) and/orstep (ii) may be generated by one of the following restriction enzymes:BbsI, BsmBI, BsaI, AarI, BtgZI, or SapI. In many instances at least oneof the first capture vector, the second capture vector and/or thefunctional vector contain a counter selectable marker gene. In oneembodiment the counter selectable marker gene may be a toxin gene suchas, e.g., ccdB or tse2.

Example 3 describes various embodiments of the two-step assembly methodoutlined above. In certain instances it may be required tosequence-verify intermediate and/or final assembly products to ensuresequence correctness and functionality in downstream applications suchas, e.g., expression experiments.

One first protocol that may be used to produce functional TAL effectorfusions based on the two-step assembly method therefore involves a firstsequence evaluation of TAL repeat subsets in capture vectors obtainedfrom assembly step (i) to allow selection of correct sequences forsubsequent assembly step (ii), followed by a second sequence evaluationof the final TAL effector fusion and a final plasmid preparation fordownstream applications. A standard lab workflow for the two-stepassembly of a TAL effector fusion according to such first protocol maytherefore be characterized by the following steps: Day 1: step (i)assembly of TAL repeat subsets in capture vectors followed bytransformation of the reactions into chemically or electro-competentbacteria (such as, e.g., E. coli) via heat shock- orelectroporation-based methods, respectively, and plating on selectivemedia; Day 2: colony PCR (“cPCR”) for quick identification of clonescarrying capture vectors with assembled TAL repeat subsets of correctlength followed by inoculation of selective media cultures cultures withselected cfu (“colony-forming units”); Day 3: plasmid preparation from(typically overnight) cultures and sequencing of TAL repeat subsets orparts thereof; Day 4: step (ii) assembly of TAL effector fusions fromsequence-verified TAL repeat subsets followed by transformation intocompetent bacteria and plating on selective media; Day 5: cPCR toidentify clones carrying assembled TAL effector fusions followed byinoculation of selective media cultures culture(s) with selected cfu;Day 6: plasmid preparation from (typically overnight) culture(s) andsequencing of TAL effector fusions or parts thereof. The skilled personunderstands that cPCR can be replaced by other screening protocols knownin the art (e.g., by growing each colony in selective media culture,subsequent plasmid preparation, digestion of the plasmid withrestriction enzyme(s) that excises the insert, followed by separation byagarose gel electrophoresis) to identify positive clones. Furtherinformation on related cloning techniques and underlying protocols canbe obtained, e.g., from Russell D W, Sambrook J (2001). Molecularcloning: a laboratory manual. Cold Spring Harbor, N.Y: Cold SpringHarbor Laboratory. Thus, starting with trimers selected from a trimerlibrary the two-step assembly method resulting in μg-amounts ofsequence-verified TAL effector fusion plasmid may be performed withinsix days if sequence evaluation is performed after each assembly step(see TABLE 13 first protocol). A detailed description of an embodimentaccording to such first protocol is provided in Example 3a.

High Speed TAL assembly. In certain instances it may be desirable toreduce production time for customized TAL effector fusions therebyachieving shorter delivery times which may be of particular interest forcustomers ordering TAL-related services from a service provider asdescribed elsewhere herein. Furthermore, a higher automation level ofassembly processes can be achieved by reduction of required method stepsresulting in less hands-on time. Thus, to minimize production time forcustomized TAL effector fusions, the inventors have further optimizedthe assembly procedure resulting in a second protocol. A lab workflowfor the two-step assembly of a TAL effector fusion according to suchsecond protocol may therefore be characterized by the following steps:Day 1: step (i) assembly of TAL repeat subsets in capture vectorsfollowed by transformation of the reactions into competent bacteria (asoutlined in the first protocol), and subsequent growth of pooledtransformants in selective media cultures; Day 2: plasmid preparationfrom (typically overnight) cultures followed by step (ii) assembly ofTAL effector fusions from purified pools of capture vectors containingnon sequence-verified TAL repeat subsets followed by transformation intocompetent bacteria and plating on selective media; Day 3: cPCR toidentify clones carrying assembled TAL effector fusions of correctlength followed by inoculation of selective media culture(s) withselected cfu; Day 4: plasmid preparation from (typically overnight)culture(s) and sequencing of TAL effector fusions or parts thereof. Theworkflow according to such second protocol is summarized in TABLE 13.The findings that μg-amounts of correctly assembled TAL effector fusionscan be obtained within four days using the two-step assembly workflowaccording to the second protocol indicate that step (i) assembly ofintermediate TAL repeat subsets is particularly efficient allowingsubsequent processing via step (ii) assembly without prior screening andpre-selection of correctly assembled capture vectors. Thus, whereas step(ii) assembly according to the first protocol is performed usingpre-selected capture vectors with assembled TAL repeat subsets isolatedfrom single cfu, step (ii) assembly according to the second protocol isperformed with a pool of capture vectors with assembled TAL repeatsubsets resulting from step (i) assembly without prior screening andclone selection.

Thus, the invention further relates to a second two-step method forassembling a functional vector comprising a TAL effector composed of nTAL cassettes wherein the method starts with a library of TAL cassettetrimers comprising all possible combinations of three cassettes whereineach cassette is capable of specifically binding one nucleotide saidmethod being characterized by at least the following steps: (i) in afirst step performing a first reaction wherein cassettes 1 to n/2 areconcurrently cloned in a first capture vector using n/6 trimers,performing at least a second reaction wherein cassettes (n/2+1) to n areconcurrently cloned in a second capture vector using n/6 trimers,wherein in the first capture vector cassettes 1 to n/2 are flanked by afirst and a second type IIS cleavage site and in the second capturevector cassettes (n/2+1) to n are flanked by a second and a thirdcleavage site, and wherein the first, second, and third cleavage sitesprovide different overhangs when cleaved with one or more restrictionenzymes and (ii) performing a third reaction wherein at least cassettes1 to n/2 and cassettes (n/2+1) to n are released from the at least firstand second capture vector in the presence of one or more, preferably thesame type IIS restriction enzyme and are cloned in directed order viacompatible ends of the first, second and third cleavage sites into afunctional vector that provides overhangs compatible with the first andthe third cleavage site, and wherein the third reaction is performedusing a pool of isolated first capture vectors at least a portionthereof carrying assembled TAL repeat subsets obtained from the firstreaction and a pool of isolated second capture vectors at least aportion thereof carrying assembled TAL repeat subsets resulting from thesecond reaction. In one embodiment according to the second protocol thefunctional vector may be provided in a linearized form. In anotherembodiment the functional vector may be provided in a closed circularform and may be cleaved together with the at least first and secondcapture vector in the same reaction. In another embodiment the at leastfirst and second capture vector and the functional vector are cleaved bythe same type IIS restriction enzyme. A detailed description of anembodiment according to such second protocol is provided in Example 3b.

The two-step method according to the second protocol may further becharacterized in that the at least two reactions of step (i) areperformed in parallel. Furthermore, in some embodiment, no PCR step isinvolved in either of steps (i) or (ii). The assembly reaction in step(i) and/or step (ii) may be performed in the presence of a ligase suchas, e.g., a T4 or Taq ligase. In some instances, at least one overhangin the reactions in step (i) and/or step (ii) may be generated by one ofthe following restriction enzymes: BbsI, BsmBI, BsaI, AarI, BtgZI, orSapI. In many instances at least one of the first capture vector, thesecond capture vector and/or the functional vector contain a counterselectable marker gene. In one embodiment the counter selectable markergene may be a toxin gene such as, e.g., ccdB or tse2 as describedelsewhere herein.

In efforts to further optimize and speed-up the assembly process theinventors have developed a third protocol according to which theproduction of TAL effector fusions can be achieved within three days. Alab workflow for the two-step assembly of a TAL effector fusionaccording to such third protocol may therefore be characterized by thefollowing steps: Day 1: step (i) assembly of TAL repeat subsets incapture vectors followed by step (ii) assembly and subsequenttransformation of the reactions into competent bacteria (as outlined inthe first protocol), and plating on selective media; Day 2: cPCR toidentify clones carrying assembled TAL effector fusions of correctlength followed by inoculation of selective media culture(s) withselected cfu; Day 3: plasmid preparation from (typically overnight)culture(s) and sequencing of TAL effector fusions or parts thereof. Theworkflow according to such third protocol is summarized in TABLE 13. Thefindings that μg-amounts of correctly assembled TAL effector fusions canbe obtained within three days using the two-step assembly workflowaccording to the third protocol indicate that the reaction productsobtained from step (i) assembly (i.e., capture vectors containing TALrepeat subsets) can be directly used in step (ii) assembly without prioramplification (i.e. transformation and growth in selective mediaculture) and isolation (i.e., plasmid preparation) of assembled capturevectors. Thus, whereas according to the second protocol step (ii)assembly is performed with isolated pools of capture vectors carryingTAL repeat subsets resulting from step (i) assembly, step (ii) assemblyaccording to the third protocol is performed using the reaction mixturefrom step (i) assembly or portions thereof (containing capture vectorswith assembled TAL repeat subsets) without prior amplification andisolation of assembled capture vectors.

Thus, the invention further relates to a third two-step method forassembling a functional vector comprising a TAL effector composed of nTAL cassettes wherein the method starts with a library of TAL cassettetrimers comprising all possible combinations of three cassettes whereineach cassette is capable of specifically binding one nucleotide saidmethod being characterized by the following steps: (i) in a first stepperforming a first reaction wherein cassettes 1 to n/2 are concurrentlycloned in a first capture vector using n/6 trimers, performing at leasta second reaction wherein cassettes (n/2+1) to n are concurrently clonedin a second capture vector using n/6 trimers, wherein in the firstcapture vector cassettes 1 to n/2 are flanked by a first and a secondtype IIS cleavage site and in the second capture vector cassettes(n/2+1) to n are flanked by a second and a third cleavage site, andwherein the first, second, and third cleavage sites provide differentoverhangs when cleaved with one or more restriction enzymes and (ii)performing a third reaction wherein at least cassettes 1 to n/2 andcassettes (n/2+1) to n are released from the at least first and secondcapture vector in the presence of one or more, preferably the same typeIIS restriction enzyme and are cloned in directed order via compatibleends of the first, second and third cleavage sites into a functionalvector that provides overhangs compatible with the first and the thirdcleavage site and wherein the third reaction is performed using thereaction mixture from the first reaction or a portion thereof containingfirst capture vectors with assembled TAL repeat subsets and the reactionmixture from the second reaction or a portion thereof containing secondcapture vectors with assembled TAL repeat subsets. In one embodiment,according to the third protocol the functional vector may be provided ina linearized form. In another embodiment, the functional vector may beprovided in a closed circular form and may be cleaved together with theat least first and second capture vector in the same reaction. In anadditional embodiment, the at least first and second capture vector andthe functional vector are cleaved by the same type IIS restrictionenzyme. A detailed description of an embodiment according to such secondprotocol is provided in Example 3c.

The two-step method according to the third protocol may further becharacterized in that the at least two reactions of step (i) areperformed in parallel. Furthermore, in some embodiment, no PCR step isinvolved in either of steps (i) or (ii). The assembly reaction in step(i) and/or step (ii) may be performed in the presence of a ligase suchas, e.g., a T4 or Taq ligase. In some instances, at least one overhangin the reactions in step (i) and/or step (ii) may be generated by one ofthe following restriction enzymes: BbsI, BsmBI, BsaI, AarI, BtgZI, orSapI. In many instances at least one of the first capture vector, thesecond capture vector and/or the functional vector contain a counterselectable marker gene. In one embodiment the counter selectable markergene may be a toxin gene such as, e.g., ccdB or tse2 as describedelsewhere herein.

TABLE 13 Days First Protocol Second Protocol Third Protocol Day 1 step(i) assembly step (i) assembly step (i) assembly transformationtransformation step (ii) assembly inoculate culture transformation Day 2cPCR prepare plasmid DNA cPCR inoculate culture step (ii) assemblyinoculate culture transformation Day 3 prepare plasmid DNA cPCR prepareplasmid DNA sequence subparts inoculate culture sequence final constructselect correct clones select correct clone Day 4 step (ii) assemblyprepare plasmid DNA — transformation sequence final construct selectcorrect clone Day 5 cPCR — — inoculate culture Day 6 prepare plasmid DNA— — sequence final construct select correct clone

Whereas the first protocol has been shown to be most efficient (both,for 18-mer and 24-mer assembly) in terms of the number of correct cfuper experiment, the third protocol allows for significantly reducedproduction times but has a lower cloning efficiency resulting in a loweramount of correct cfu per experiment which may be compensated by theamount of screened colonies. The second protocol combines the positivefeatures of both, a high efficiency (i.e. number of correct cfuscreened) with a shorter production time. The protocol chosen forassembly will generally depend on the underlying conditions and projectrequirements. For example, in cases where capture vectors carrying TALrepeat subsets are to be used separately or recycled in other assemblyreactions, a method according to the first protocol may be mostappropriate. A method according to the first or second protocol may alsobe preferred where sequencing capacities are limited and production timeis less critical. In other cases where short production/delivery timesare of paramount importance or where assembly steps are performed onautomated or high throughput platforms, the second or in particularthird protocol may be most appropriate. In certain instances, variousprotocols may be combined or adapted or performed in parallel to achievean optimal combination of efficiency and speed. For example, the firstprotocol may be used as backup for the second protocol and the secondprotocol may be used as backup for the third protocol in case wherecolonies screened following step (ii) assembly do not contain correctsequences.

Apart from the assembly of TAL effector fusions, the protocols accordingto the invention can likewise be used for the step-wise assembly of anyother DNA molecule from multiple subfragments. Currently, suchsubfragments are mostly assembled according to first protocolembodiments via consecutive assembly reactions, each followed by aselection step. Using assembly strategies according to the second orthird protocol—or possibly combining even multiple step(i) and step(ii)assembly reactions can significantly reduce—all but the final selectionsteps could be dropped.

Thus, the invention further relates, in part, to a two-step method forassembling a DNA molecule from multiple DNA subfragments n wherein themethod starts either from n vectors each carrying a single subfragmentor from a library of subfragments wherein each vector in said librarycarries m subfragments with m<n, said method being characterized by thefollowing steps: (i) in a first step performing a first reaction whereina first amount of subfragments are concurrently cloned in a firstcloning or capture vector, performing at least a second reaction whereinat least a second amount of subfragments are concurrently cloned in asecond cloning or capture vector, wherein the first and last subfragmentin the first cloning or capture vector are flanked by a first and asecond type IIS cleavage site and the first and last subfragment in thesecond cloning or capture vector are flanked by a second and a thirdcleavage site, and wherein the first, second, and third cleavage sitesprovide different overhangs when cleaved with one or more restrictionenzymes and (ii) performing a third reaction wherein the at least firstand second amounts of subfragments are released from the at least firstand second cloning or capture vector in the presence of one or more,preferably the same type IIS restriction enzyme and are cloned indirected order via compatible ends of the first, second and thirdcleavage sites into a target vector (e.g. an expression vector) thatprovides overhangs compatible with the first and the third cleavagesite. In one embodiment, the third reaction is performed using a pool ofisolated first cloning or capture vectors obtained from the firstreaction and a pool of isolated second cloning or capture vectorsresulting from the second reaction wherein at least a portion of thepool of isolated first cloning or capture vectors contains a correctlyassembled first amount of subfragments and at least a portion of thepool of isolated second cloning or capture vectors carries a correctlyassembled second amount of subfragments. In another embodiment, thethird reaction is performed using the reaction mixture from the firstreaction or a portion thereof containing first capture vectors with acorrectly assembled first amount of subfragments and the reactionmixture from the second reaction or a portion thereof containing secondcapture vectors with a correctly assembled second amount ofsubfragments. In one embodiment, the target vector may be provided in alinearized form. In another embodiment, the target vector may beprovided in a closed circular form and may be cleaved together with theat least first and second cloning or capture vector in the samereaction. In an additional embodiment, the at least first and secondcloning or capture vector and the target vector are cleaved by the sametype IIS restriction enzyme. The efficient assembly method of theinvention allows for two or more subfragments to be cloned concurrentlyinto each cloning or capture vector or target vector. In certaininstances subfragment subsets derived from two, three, four, five or sixcloning or capture vectors may be assembled concurrently into the sametarget vector in one step (ii) assembly reaction. Likewise, two, three,four, five or six subfragments may be cloned in each cloning or capturevector in a step (i) assembly reaction and the amount of subfragmentscloned in each cloning or capture vector may be equal or different. Forexample, a first cloning or capture vector may carry three subfragments,whereas a second and a fourth cloning or capture vector may carry foursubfragments each as described below (see e.g. TABLEs 14 or 15). Incertain embodiments, the at least two reactions of step (i) assembly areperformed in parallel. Furthermore, in some embodiments, no PCR step isinvolved in either of steps (i) or (ii). The assembly reaction in step(i) and/or step (ii) may be performed in the presence of a ligase suchas, e.g., a T4 or Taq ligase. In some instances, at least one overhangin the reactions in step (i) and/or step (ii) may be generated by one ofthe following restriction enzymes: BbsI, BsmBI, BsaI, AarI, BtgZI, orSapI. In many instances, at least one of the first or second cloning orcapture vectors and/or the target vector may contain a counterselectable marker gene. In one embodiment, the counter selectable markergene may be a toxin gene such as, e.g., ccdB or tse2 as describedelsewhere herein.

The functional vector may be designed to contain at least the TALN-terminal and C-terminal domains or truncated versions thereof and acounter selectable marker gene. In addition, the functional vector maycontain at least one effector fusion such as, e.g., a fusion withactivator, repressor, nuclease, acetylase, de-acetylase, methylase,demethylase activity (see, e.g., FIG. 7C) or a effector fusion insertionsite or multiple cloning site (MCS) as shown in FIG. 8A. In oneembodiment the functional vector is a GATEWAY® entry clone comprisingatt recombination sites (see, e.g., FIGS. 8A-8C). In the functionalvector, the TAL effector sequence may be cloned 5′ or 3′ to the effectorfusion sequence or effector insertion sequence. In certain instances,the functional vector may have a sequence selected from the group of SEQID NOs: 30, 31, 32, 33, 34, 35, or 36.

In certain instances, the two-step assembly method may be used toassemble TAL effector fusions wherein the functional vector encodes aFokI nuclease cleavage domain or a truncated FokI nuclease cleavagedomain. In some embodiments, the FokI nuclease cleavage domain may carryat least one of the following mutations: E490K, I538K, H537R, Q486E,I499L, N496D, R487D, N496D, D483R, H537R

Furthermore, the functional vector used in the two-step method may carryat least one sequence that is codon-optimized with regard to a targethost including but not limited to the TAL cassettes, TAL repeat, the TALN-terminal and/or C-terminal coding sequences, the TAL effector or theTAL effector fusion sequence.

In yet another embodiment the above two step method for assembling afunctional vector may rely on tetramers or may combine trimers andtetramers with dimers etc. TABLE 14 shows examples of how differentlibrary building blocks may be combined to assemble TAL effectors with nrepeats.

TABLE 14 n (no. of repeats) first capture vector second capture vector24 4 trimers (12 cassettes) 4 trimers (12 cassettes) 24 3 tetramers (12cassettes) 3 tetramers (12 cassettes) 21 4 trimers (12 cassettes) 3trimers (9 cassettes) 20 2 tetramers and 1 dimer 2 tetramers and 1 dimer(10 cassettes) (10 cassettes) 18 3 trimers (9 cassettes) 3 trimers (9cassettes) 18 3 tetramers (12 cassettes) 2 trimers (6 cassettes) 16 2tetramers (8 cassettes) 2 tetramers (8 cassettes) 12 4 trimers — 12 3trimers (9 cassettes) 1 trimer (3 cassettes) 12 2 tetramers (8cassettes) 30 4 tetramers (16 cassettes) 2 tetramers and 2 trimers (14cassettes) 28 2 tetramers and 2 trimers 2 tetramers and 2 trimers (14cassettes) (14 cassettes)

The skilled artisan will understand that in cases where a given numberof n repeats cannot be assembled in 2 capture vectors, a third or fourthcapture vector may be used as indicated in the following example inTABLE 15:

TABLE 15 n (no. of first capture second capture third capture repeats)vector vector vector 35 3 tetramers 3 tetramers 2 tetramers and (12cassettes) (12 cassettes) 1 trimer (11 cassettes) 17 3 trimers 2 trimers1 dimer (9 cassettes) (6 cassettes) (2 cassettes)

Solid phase TAL assembly. The invention further relates to a method forassembling TAL effector molecules on a solid phase. The method allowsfor assembly of multiple different TAL effector molecules in a parallel,high-throughput and template-independent manner by using predesigneddouble stranded nucleic acid building blocks as illustrated in FIG. 9.TAL effector assembly on solid phase can be performed using singlecassettes or may be performed using trimer or tetramer librariesdescribed above. The building blocks may comprise one cassette encodinga single TAL repeat or they may comprise two or more cassettes encodingtwo or more TAL repeats. In some embodiments, a library of differentmodules is presented that contains at least three categories of modules:starter modules, elongation modules and completion modules. Asillustrated in FIG. 9A, starter modules may be designed to comprise aTAL effector 5′ flanking region attached to an anchor that can beimmobilized on a solid phase and at least one TAL cassette. The anchormay, e.g., be a biotin anchor and may be bound to a streptavidin-coatedsurface. In one embodiment the at least one cassette is a T-bindingcassette. In other embodiments the at least one cassette may be a A-, G-or C-binding cassette. In some instances starter molecules may comprisemore than one TAL cassette. For example the starter modules may comprisea trimer or tetramer of cassettes as described above. In such case thelibrary of starter modules may comprise all possible combinations oftrimers or tetramers of A-, G-, C- and T-binding cassettes etc. In FIG.9A, sixteen different starter modules are provided comprising allpossible trimer combinations starting with a T-binding cassette.Elongation modules are used to sequentially elongate the nucleic acidchain immobilized on the solid phase. Like the starter modules theelongation modules may consist of one or more cassettes. In someembodiments, the elongation modules comprise more than one cassette suchas, e.g., trimers or tetramers. A library of trimers would consist of 64elongation modules whereas a library of tetramers would consist of 256elongation modules to represent all possible combinations of A-, G-, C-and T-binding cassettes. The last cassette may be provided by acompletion module that may further carry TAL effector 3′ flankingregions. The completion module may likewise comprise one or more TALeffector cassettes.

Thus the invention relates to a library of TAL assembly modules, thatcontains three different categories of at least partially doublestranded DNA building blocks:

-   -   (i) starter modules comprising a modification by which they can        be immobilized on a solid phase, 5′ TAL flanking sequences and        one or more TAL cassettes wherein the starter module contains at        least one first type IIS cleavage site flanking the 5′ TAL        sequences and at least one second type IIS cleavage site        flanking the 3′ end of the TAL cassettes.    -   (ii) elongation modules comprising one or more TAL cassettes        wherein the 5′ and 3′ ends of the TAL cassette sequence are        flanked by a third and a fourth type IIS cleavage site, and    -   (iii) completion modules comprising at least one or more TAL        cassettes and 3′ TAL flanking sequences wherein the completion        module contains at least a fifth type IIS cleavage site flanking        the 5′ end of the TAL cassettes and a sixth type IIS cleavage        site flanking the 3′ ends of the 3′ TAL flanking sequences.

In some instances the solid phase assembly of TAL effector molecules maystart with the immobilization of a starter molecule on a solid supportfollowed by repeated cycles of type IIS-mediated cleavage and ligationof selected elongation modules as described in FIG. 9B-9G therebysequentially elongating the TAL effector sequence. In the last cycle acompletion module may be added to provide the 3′ TAL flanking sequences.After completion of the modular assembly, the full-length TAL effectorsequence comprising a defined number of TAL cassettes may be releasedfrom the solid support and cloned into a functional vector via terminaltype IIS cleavage sites.

Thus, the invention further relates to a method for the manufacture of anucleic acid molecule encoding a TAL effector or TAL effector fusioncomprising the steps of

-   -   a) providing a double-stranded starter module which has a        modification by which it is immobilized on a surface, wherein        the starter module comprises at least 5′ TAL flanking sequences        fused to one or more TAL cassettes and, at least a first        recognition site for a first type IIS enzyme to generate a        single-stranded overhang at the 5′ end of the starter module,        and a second recognition site for a second type IIS enzyme to        generate a single-stranded overhang at the 3′ end of the starter        module, and which starter module is provided with a        single-stranded overhang at the 3′ end following cleavage with        the second type IIS enzyme,    -   b) providing a first double-stranded elongation module wherein        the elongation module comprises one or more TAL cassettes and at        least a third recognition site for a third type IIS enzyme to        generate a single-stranded overhang at the 5′ end of the        elongation module and a fourth recognition site for a fourth        type IIS enzyme to generate a single-stranded overhang at the 3′        end of the elongation module, and which elongation module        comprises a single-stranded overhang at the 5′ end following        cleavage with the third type IIS enzyme, wherein said        single-stranded overhang at the 5′ end is complementary to the        single-stranded overhang at the 3′ end of the starter module    -   c) ligating the starter module and the first elongation module        via their overhangs generating a first ligation product,    -   d) cutting the ligation product with the fourth type IIS        restriction enzyme to generate a single-stranded overhang at the        3′ end of the first elongation module.    -   e) providing a second double-stranded elongation module that        comprises a single-stranded overhang at the 5′ end following        cleavage with the third type IIS enzyme, wherein said        single-stranded overhang at the 5′ end is complementary to the        single-stranded overhang at the 3′ end of the first elongation        module    -   f) ligating the second elongation module and the first ligation        product via their overhangs generating a second ligation        product,    -   g) optionally, repeating steps d) to f) until a desired number        of further elongation cassettes has been added,    -   h) providing a double-stranded completion module comprising at        least 3′ TAL flanking sequences fused to one or more TAL        cassettes and, at least a fifth recognition site for a fifth        type IIS enzyme to generate a single-stranded overhang at the 5′        end of the completion module and a sixth recognition site for a        sixth type IIS enzyme to generate a single-stranded overhang at        the 3′ end of the completion module, and which completion module        is provided with a single-stranded overhang at the 5′ end        following cleavage with the fifth type IIS enzyme,    -   i) ligating the completion module and the final elongation        module of the immobilized ligation product via their overhangs        generating a final ligation product, and    -   j) releasing the final ligation product via cleavage with the        first and sixth type IIS enzymes.

In one embodiment, the first type IIS enzyme may be the same as thesixth type IIS enzyme and the second type IIS enzyme may be the same asthe fourth type IIS enzyme. In most instances the at least firstcassette of the starter module in step a) may be a T-binding cassette.In some embodiments, the starter modules, the elongation modules and thecompletion modules comprise three or four TAL cassettes.

Vectors for Assembly of TAL Effector Constructs

Nucleic acids which encode TAL effectors and TAL effector fusions may beconstructed, propagated, and used to generate TAL proteins by aconsiderable number of methods, including Type IIS restriction enzymeassembly systems, as described elsewhere here.

In many instances, nucleic acids which encode TAL effectors and TALeffector fusions may either be integrated in cellular nucleic acid(e.g., a chromosome, etc.) or contained within a vector (e.g., aplasmid, a lentiviral vector, etc.).

Nucleic acid molecules encoding TAL proteins may have any number ofcomponents. As an example, TAL effector fusions will typically containthe following regions: (1) A region with two or more TAL repeats, (2)polypeptide regions flanking the TAL repeat region, and (3) a fusionpartner. Some examples of additional regions which may be presentinclude: (1) A linker region (e.g., a linker which connect the fusionpartner to the TAL effector) and (2) a tag region (e.g., an affinitypurification tag). Examples of nucleic acids which encode TAL fusionproteins are shown in the lower portion of FIG. 7B

Vectors which contain TAL coding sequences can be generated by anynumber of methods. In some instances, TAL cassette nucleic acid may bechemically synthesized, then either individually connected to orinserted into other nucleic acid molecules (e.g., a vector) or connectedto other TAL cassettes then connected to or inserted into other nucleicacid molecules (e.g., a vector). Methods for the construction of nucleicacid segments encoding TAL repeats is described elsewhere herein.

A series of closed, circular nucleic acid molecule into which TALcassettes and TAL repeats may be inserted are shown in FIG. 7C. Thisfigure shows vectors which may be used to generate (1) a TALeffector-FokI fusion nuclease pair, (2) a TAL effector-VP16, (3) a TALeffector-KRAB domain fusion, and (4) a TAL effector-effector protein(e.g., acetylase, deacetylase, methylase, demethylase, kinase,phosphatase, etc.) fusion. In each instance, the starting nucleic acidmolecule is digested with a restriction enzyme that cuts at a site whichdiffers from the recognition site (e.g., a Type II restriction enzymessuch as a Type IIS enzymes). The results is excision of ccdB (oralternatively tse2) coding sequence from the vector and the formation ofa linear vector. This vector may have any of the following: (1) twoblunt termini, (2) two termini with overhanging ends (e.g., two 5′overhangs, two 3′ overhangs, or a 5′ and a 3′ overhang), one bluntterminus and on overhanging terminus (a 5′ or a 3′ overhang).

Termini may be linked by any number of methods. Ligases (e.g., T4 DNAligase) and topoisomerases are examples of enzymes which may be used tocovalently connect one or both strands of different termini to eachother. Ligases may be used, for example, to covalently connect bothstrands of both termini of a vector with both strands of both termini ofanother nucleic acid molecule (e.g., an insert) to generate anun-nicked, closed, circular nucleic acid molecule. As a specific example(see FIG. 10A), one terminus of the vector and one terminus of the othernucleic acid molecule may have complementary overhangs and the two othertermini of both molecules may be blunt. In such a case, thecomplementarity of the overhanging termini may be used to direct toorientation by which the two nucleic acid molecules are connected toeach other (e.g., so that the insert molecule go into the vectormolecules in the same orientation).

Topoisomerase are categorized as type I, including type IA and type D3topoisomerase, which cleave a single strand of a double stranded nucleicacid molecule, and type II topoisomerase (gyrase), which cleave bothstrands of a nucleic acid molecule. Type IA and IB topoisomerases cleaveone strand of a double-stranded nucleotide molecule. Cleavage of adouble-stranded nucleotide molecule by type IA topoisomerases generatesa 5′ phosphate and a 3′ hydroxyl at the cleavage site, with the type IAtopoisomerase covalently binding to the 5′ terminus of a cleaved strand.In comparison, cleavage of a double-stranded nucleotide molecule by typeD3 topoisomerases generates a 3′ phosphate and a 5′ hydroxyl at thecleavage site, with the type D3 topoisomerase covalently binding to the3′ terminus of a cleaved strand. Type I and type II topoisomerases, aswell as catalytic domains and mutant forms thereof, are useful forgenerating double-stranded recombinant nucleic acid molecules.

Type IA topoisomerases include E. coli topoisomerase I, E. colitopoisomerase III, eukaryotic topoisomerase II, archeal reverse gyrase,yeast topoisomerase III, Drosophila topoisomerase III, humantopoisomerase III, Streptococcus pneumoniae topoisomerase III, and thelike, including other type IA topoisomerases. E. coli topoisomerase III,which is a type IA topoisomerase that recognizes, binds to and cleavesthe sequence 5′-GCAACTT-3′, can be particularly useful in methods of theinvention.

Type D3 topoisomerases include the nuclear type I topoisomerases presentin all eukaryotic cells and those encoded by Vaccinia and other cellularpoxviruses. The eukaryotic type D3 topoisomerases are exemplified bythose expressed in yeast, Drosophila and mammalian cells, includinghuman cells. Viral type D3 topoisomerases are exemplified by thoseproduced by the vertebrate poxviruses (Vaccinia, Shope fibroma virus,ORF virus, fowlpox virus, and Molluscum contagiosum virus), and theinsect poxvirus (Amsacta moorei entomopoxvirus).

Type II topoisomerases include, for example, bacterial gyrase, bacterialDNA topoisomerase IV, eukaryotic DNA topoisomerase II, and T-even phageencoded DNA topoisomerases. Like the type D3 topoisomerases, the type IItopoisomerases have both cleaving and ligating activities. In addition,like type D3 topoisomerase, substrate double-stranded nucleic acidmolecules can be prepared such that the type II topoisomerase can form acovalent linkage to one strand at a cleavage site. For example, calfthymus type II topoisomerase can cleave a substrate double-strandednucleic acid molecule containing a 5′ recessed topoisomerase recognitionsite positioned three nucleotides from the 5′ end, resulting indissociation of the three nucleotide sequence 5′ to the cleavage siteand covalent binding the of the topoisomerase to the 5′ terminus of thedouble-stranded nucleic acid molecule. Furthermore, upon contacting sucha type II topoisomerase-charged double-stranded nucleic acid moleculeswith a second nucleotide sequence containing a 3′ hydroxyl group, thetype II topoisomerase can ligate the sequences together, and then isreleased from the recombinant nucleic acid molecule. As such, type IItopoisomerases may be incorporated into compositions of the inventionand also are useful for performing methods of the invention.

The invention includes methods for generating double-stranded nucleicacid molecules molecule with topoisomerase covalently linked at leastone terminus. As an example, a double-stranded nucleic acid moleculewith the following sequence at a terminus:

CCCTTATT-3′ Terminus GGGAATAA-5′ Terminus

may be contact with a Vaccinia topoisomerase (a Type D3 topoisomerase)under conditions suitable to generate the following terminus:

CCCTT-3′ Terminus GGGAATAA-5′ Terminus

with topoisomerase covalently bound to the 3′ phosphate. After nickingof the double-stranded nucleic acid molecule, the ATT segment will nolonger be covalently bound and will tend to dissociate from thedouble-stranded nucleic acid molecule, leaving an overhanging sequenceof 3′-TAA-5′.

The invention thus includes (1) nucleic acid molecules which contain oneor more (e.g., one, two, three, four, five, six, from about one to abouttwo, from about one, to about five, etc.) topoisomerase recognitionsites, (2) nucleic acid molecules which contain one or more bound (e.g.,covalently bound) topoisomerase, (3) methods for producing nucleic acidmolecules of (1) and (2), and (4) methods for connecting nucleic acidmolecules of (1) and (2), to other nucleic acid molecules.

FIG. 10A shows one embodiment where topoisomerase is covalently bound tothe 3′ phosphate at both ends of the vector. When a compatible nucleicacid terminus (a terminus with a strand having a 5′ hydroxyl) comes intocontact with the topoisomerase adapted terminus, strands of eachterminus are covalently connected to each other, resulting is a nick ator near the junction point. This nick may be repaired by any number ofmeans but will normally be automatically repaired upon introduction intoa cell by DNA repair mechanisms.

Two types of inserts are shown in FIG. 10A. Insert 1 is designed tohybridize with the vector by sequence complementarity of overhangingends (with Y being bases that will pair with X bases). Insert 2hybridizes to the vector also by sequence complementarity by a strandinvasion mechanism whereby the single stranded 3′ XXXX sequence of thevector hybridizes to the 5′ YYYY sequence of Insert 2, resulting in a“flap” of Xs hanging off of Insert 2. Further, the 5′ terminal Y ofInsert 2 becomes covalently bound to the 3′ T of the vector. Thiscovalent bound stabilizes the association of the vector with Insert 2.As with nicks, the flap may be automatically removed by introduction ofthe assembly into a cell with functional DNA repair mechanisms,resulting in a junction where both strands are covalently bound to eachother and complete hybridization of regional nucleotides (e.g., nomismatched bases).

Nucleic acid molecule of the invention and used in the practice of theinvention may also contain recombination sites, also referred to asrecombinational cloning site. Recombination sites suitable for use inthe invention may be any nucleic acid that can serve as a substrate in arecombination reaction. Such recombination sites may be wild-type ornaturally occurring recombination sites, or modified, variant,derivative, or mutant recombination sites. Examples of recombinationsites for use in the invention include, but are not limited to, lambdaphage recombination sites (such as attP, attB, attL, and attR andmutants or derivatives thereof) and recombination sites from otherbacteriophage such as phi80, P22, P2, 186, P4 and P1 (including loxsites such as loxP and loxP511). Mutated att sites (e.g., attB, attP,attR and attL sites) are described in U.S. Patent Publication No.2011/0275541, which is incorporated herein by reference. Otherrecombination sites having unique specificity (i.e., a first site willrecombine with its corresponding site and will not recombine with asecond site having a different specificity) are known to those skilledin the art and may be used to practice the present invention.Corresponding recombination proteins for these systems may be used inaccordance with the invention with the indicated recombination sites.Other systems providing recombination sites and recombination proteinsfor use in the invention include the FLP/FRT system from Saccharomycescerevisiae, the resolvase family (e.g., y5, TndX, TnpX, Tn3 resolvase,Hin, Hjc, Gin, SpCCE1, ParA, and Cin), and IS231 and other Bacillusthuringiensis transposable elements. Other suitable recombinationsystems for use in the present invention include the XerC and XerDrecombinases and the psi, dif and cer recombination sites in E. coli.Suitable recombination proteins and mutant, modified, variant, orderivative recombination sites for use in the invention include theGATEWAY® Cloning Technology and Multi-Site GATEWAY® Cloning Technologyare available from Life Technologies Corp. (Carlsbad, Calif.)

Att site based recombination systems that may be used in conjunctionwith the present invention include those which rely on the followingprinciples of operation. In the presence of a mixture of specificrecombination proteins, attB site will recombine with attP sites,resulting in the generation of attL sites and attR sites. The reversereaction may also occur in the presence of another mixture of specificrecombination proteins. Further, att sites have been designed and mayfurther be designed which have particular recombination specificities

Representative examples of recombination sites which can be used in thepractice of the invention include att sites referred to above. Att siteswhich specifically recombine with other att sites can be constructed byaltering nucleotides in and near the 7 base pair overlap region. Thus,recombination sites suitable for use in the methods, compositions, andvectors of the invention include, but are not limited to, those withinsertions, deletions or substitutions of one, two, three, four, or morenucleotide bases within the 15 base pair core region (GCTTTTTTATACTAA(SEQ ID NO:70)), which is identical in all four wild-type lambda attsites: attB, attP, attL and attR. Recombination sites suitable for usein the methods, compositions, and vectors of the invention also includethose with insertions, deletions or substitutions of one, two, three,four, or more nucleotide bases within the 15 base pair core regionreferred to above and those which are at least 50% identical, at least55% identical, at least 60% identical, at least 65% identical, at least70% identical, at least 75% identical, at least 80% identical, at least85% identical, at least 90% identical, or at least 95% identical to this15 base pair core region.

The region defined by the sequence TTTATAC in the 15 base pair isreferred to the seven base pair overlap region. The seven base pairoverlap region is the cut site for the integrase protein and is theregion where strand exchange takes place.

Altered att sites have been constructed which demonstrate that (1)substitutions made within the first three positions of the seven basepair overlap (TTTATAC) strongly affect the specificity of recombination,(2) substitutions made in the last four positions (TTTATAC) onlypartially alter recombination specificity, and (3) nucleotidesubstitutions outside of the seven by overlap, but elsewhere within the15 base pair core region, do not affect specificity of recombination butdo influence the efficiency of recombination. Thus, nucleic acidmolecules and methods of the invention include those which comprising oremploy one, two, three, four, five, six, eight, ten, or morerecombination sites which affect recombination specificity, particularlyone or more (e.g., one, two, three, four, five, six, eight, ten, twenty,thirty, forty, fifty, etc.) different recombination sites that maycorrespond substantially to the seven base pair overlap within the 15base pair core region, having one or more mutations that affectrecombination specificity. Particularly, such molecules may comprise aconsensus sequence such as NNNATAC, wherein “N” refers to any nucleotide(i.e., may be A, G, T/U or C), as well as modified and non-standardnucleotides such as inosine. In some instances, if one of the firstthree nucleotides in the consensus sequence is a T/U, then at least oneof the other two of the first three nucleotides is not a T/U. Exemplaryseven base pair att site overlap regions suitable for with the inventionare set out in TABLE 16.

TABLE 16 AAAATAC CAAATAC GAAATAC TAAATAC AACATAC CACATACGACATAC TACATAC AAGATAC CAGATAC GAGATAC TAGATACAATATAC CATATAC GATATAC TATATAC ACAATAC CCAATACGCAATAC TCAATAC ACCATAC CCCATAC GCCATAC TCCATACACGATAC CCGATAC GCGATAC TCGATAC ACTATAC CCTATACGCTATAC TCTATAC AGAATAC CGAATAC GGAATAC TGAATACAGCATAC CGCATAC GGCATAC TGCATAC AGGATAC CGGATACGGGATAC TGGATAC AGTATAC CGTATAC GGTATAC TGTATACATAATAC CTAATAC GTAATAC TTAATAC ATCATAC CTCATACGTCATAC TTCATAC ATGATAC CTGATAC GTGATAC TTGATACATTATAC CTTATAC GTTATAC TTTATAC

Type IIS Topoisomerase Assembly Toolkit. Based on type IISrestriction-mediated shuffling and topoisomerase mediated cloning theinventors have further developed a high throughput friendly DNA assemblykit suitable for assembling complex DNA binding effector molecules whichprovides a commercial solution, rich in consumables, for DNA shufflingcloning. In some aspects, the kit allows for rapid generation ofintermediate cloning vectors that can be combined to generate the finalfull-length construct. One possible workflow of performing the inventionis summarized in FIG. 11A of this document. In this example, a series oftopoisomerase adapted donor vectors with symmetrical ends are built thatdiffer only on the kind of type IIS restriction sites present at theirends. A design software tool selects a donor vector that is compatiblewith the nucleic acid sequence to be assembled. In other words, thecorresponding type IIS restriction sites must be absent in the inputsequence. The design software also generates the subsequences orsequence fragments that will appear in each entry clone (i.e., a donorvector with insert) based on compatibility of the adjacent ends. Thus,in some embodiments, the choice of topoisomerase adapted vector and thefragmentation of a given full-length sequence can be determined by agene assembly algorithm that analyzes the input sequence, identifiesnon-cutter type IIS restriction enzymes, and recommends a subcloningstrategy. One advantage of the invention is the symmetrical nature ofthe donor vector which does not impose limitations on the direction ofthe cloned sequence, virtually eliminating screening requirements. In ananalogy with the MultiSite GATEWAY® technology (see, e.g., U.S. Pat. No.8,030,066, the disclosure of which is incorporated herein by reference),the resulting entry clones carrying subfragments in unspecificorientation are combined with a target or destination vector (e.g., afunctional vector) harboring compatible ends (e.g., flanking a counterselectable marker in the parental plasmid). An enzymatic mix composed ofat least the corresponding type IIS restriction enzyme plus ligase isadded, and after a brief incubation period an aliquot is transformedinto competent cells. The resulting clone may for example be anexpression vector.

Thus, in some embodiments, the invention comprises a product composed atleast of: a web tool that (i) is capable of splitting a given wild-typesequence into smaller parts, (ii) develops an assembly strategy for theentry and expression clones based on cleavage sites absent in thewild-type sequence, (iii) designs the required oligonucleotides and (iv)indicates what kit should be used; and a series of kits (A, B, C, . . .N), each composed at least of (i) a linear topoisomerase-adapted (donor)vector, (ii) an enzyme mix comprising at least a DNA ligase and a typeIIS cleavage enzyme, (iii) a (linearized) destination vector, and (iv)competent cells.

In another aspect, a customized kit differs from the series of differentkits in the kind of type IIS restriction enzyme cleavage sites of thedonor and destination vectors and the kind of type IIS restrictionenzyme present in the enzyme mixture.

In one embodiment, the invention relates to a customized assembly kit asdescribed above, wherein the destination vector is a functional TALeffector vector.

Topoisomerase based assembly kits can be used by customers to assembleDNA subfragments that have been obtained, e.g., by PCR amplification,restriction digest or other methods known in the art (FIG. 11B, leftchart). Thus, the invention relates, in part, to a first workflow forgene assembly including some or all of the following steps: (i)obtaining an input sequence from a customer and obtaining vectorsequence information and restriction enzyme information from at leastone database; (ii) analyzing the input sequence at least for absentcleavage sites, generating subfragments of the sequence bycomputer-aided means; generating an assembly strategy by computer-aidedmeans and selecting vector and enzyme combinations from at least onedatabase; (iii) composing a customized toolkit comprising at least (a)one or more topoisomerase-adapted donor vectors with flanking type IISrecognition sites, (b) a destination vector with a selectable markersequence, (c) an enzyme mix containing at least one type IIS enzyme anda ligase and (d) competent cells; and (iv) shipping the customizedtoolkit to the customer for assembly.

In cases where no DNA template is available or customer requests anoptimized or modified sequence, a gene synthesis service provider canintegrate the vectors and assembly methods illustrated in FIG. 11A intothe internal manufacturing process to assemble de novo synthesized andoptionally optimized and/or modified genes which are then cloned,purified and subjected to a quality control (QC) process before thefinal synthetic gene is delivered to a customer. Thus, the inventionrelates, in part, to a second workflow for gene assembly including someor all of the following steps: (i) obtaining an input sequence from acustomer and, optionally a request for sequence optimization, andobtaining vector sequence information and restriction enzyme informationfrom at least one database; (ii) analyzing the input sequence at leastfor absent cleavage sites, optionally optimizing the input sequence bycomputer-aided means, generating subfragments of the sequence bycomputer-aided means, generating an assembly strategy by computer-aidedmeans and selecting vector and enzyme combinations from at least onedatabase; (iii) synthesizing the subfragments from overlappingoligonucleotides; (iv) cloning the subfragments into one or moretopoisomerase-adapted donor vectors with flanking type IIS recognitionsites; (v) assembling the subfragments simultaneously into at least onedestination vector; (vi) transforming competent cells, and (vii) deliverpurified and QC analysed synthetic gene to the customer.

The invention also relates to embodiments of either of the aboveworkflows wherein the destination vector is a functional vectorcontaining at least the N- and C-terminal flanking sequences of a TALeffector and, optionally an effector fusion sequence, and wherein thesubfragments are TAL nucleic acid binding cassettes and/or TAL repeats.

Universal TAL Assembly Kit. Apart from customized vector services orweb-aided toolkits as described above, a universal TAL assembly kit maybe an interesting alternative for customers who prefer doing most of thework themselves using standardized parts. In such instances, allcomponents required for TAL effector assembly would be delivered in kitformat by a service provider and assembly performed by customeraccording to the provided protocol. A universal TAL assembly kit thatcan be used for the assembly of various TAL effector fusions with anydesired binding specificity. As described elsewhere herein, a TALeffector may contain a variable amount of repeats (typically between 1.5and 33.5). The advantage of a smaller amount of repeats is the reducedcomplexity of assembly steps whereas a larger amount of repeats may bemore reliable in terms of binding efficiency and specificity. The amountof repeats to be assembled by means of a universal assembly kit shouldtherefore be in a reasonable range resulting in reliable binding withoutmaking correct assembly an experimental challenge. A smaller amount ofrepeats (such as, e.g., 6, 8 or 10 repeats) may be assembled by atwo-step assembly method according to one of the protocols describedherein using monomeric cassette building blocks, i.e., one cassette perbuilding block. For example, three cassettes may be assembled into twocapture vectors each and the two resulting trimers subsequently combinedinto a functional vector. Likewise, five cassettes may be assembled intoeach capture vector and the resulting 5-mers combined into a 10-merrepeat in the functional vector. In cases where larger arrays are to beassembled, a repeat library containing pre-synthesized combinations oftwo or more cassettes may be useful to limit the amount of fragments tobe assembled per step and the amount of parallel assembly reactions.

A universal assembly kit of the invention may therefore contain a readyto use TAL cassette library, i.e., a collection of building blockscontaining a specific combination of two, three or four or even more TALcassettes. One embodiment of the invention described herein provides fortrimer or tetramer libraries to assemble arrays of about 17.5 or 23.5 oreven more repeats. Whereas trimer or tetramer libraries may be preferredin high-throughput assembly settings as described above, a librarycontaining fewer building blocks may be a better starting point for auniversal and well-arranged assembly kit. To provide a completecollection of triplet combinations of binding cassettes representing allpossible positions within a 17.5 or a 23.5 repeat containing TALeffector, a trimer library would require a huge amount of individualconstructs (512 clones in Example 3) which may be difficult to storeand/or handle in a kit system. In contrast, using a library based on TALrepeat dimers would reduce the amount of required building blocks perkit without limiting the possibility to assemble all combinations of TALnucleic acid binding cassettes. Whereas such dimer library can beprovided to customer in a well-arranged format, the kit providerbenefits from less manufacture work or reproduction of fewer componentsper kit which makes the kit more cost-efficient.

A dimer library of the invention contains at least four differentcategories of cassettes each of which allows for specific binding of abase via its defined RVD. For example, the cassettes of the kit maycontain the following RVDs: NI for A, NK for G, HD for C and NG for T.As described elsewhere herein, alternative RVDs may be chosen as somebases are bound by different RVDs whereas some RVDs bind different bases(e.g., methylated versus non-methylated cytosine). The combination ofeach of the cassettes into pairs results in 16 distinct combinations(NI-NI, NI-NK, NI-HD, NI-NG, NK-NK, NK-HD, NK-NG, HD-HD, HD-NI, HD-NK,HD-NG, NG-NG, NG-NI, NG-NK, NG-HD). To allow for directed assembly intoeach possible position within a given repeat array each pair is flankedby a 5′ region containing at least a first type IIS restriction enzymecleavage sites at the and by a 3′ region containing at least a secondtype IIS restriction enzyme cleavage site which generate uniqueprotruding ends after cleavage. In one embodiment the 5′ and 3′ regionsof each cassette pair have identical cleavage recognition sites butproduce different single stranded overhangs upon cleavage to allow fordirectional assembly. Typically, the cassette pairs with cleavable 5′and 3′ regions are inserted into plasmids to be stored as individualdimer building blocks. A selected set of building blocks can then beassembled into a capture vector by simultaneous cleavage according tothe “Golden Gate” cloning strategy as described elsewhere herein toconnect each cassette pair with a compatible overhang of another pair orwith a compatible overhang of the respective capture vector.

FIG. 32A shows an example of how a collection of dimer building blockscan be organized to allow for the assembly of a 16-repeat TAL array. Inthis example, a library of 96 TAL repeat dimer blocks, i.e., acollection of 96 purified plasmids each containing a pair of TAL bindingcassettes is provided for two step-assembly of a TAL effector fusion. Ina first step, 4 selected dimer pairs are assembled into each of the twocapture vectors via BsaI-mediated cleavage. In the second step, theresulting 8-mers are combined into a functional expression vectorencoding a nuclease function via AarI-mediated cleavage. As indicated inFIG. 32A and shown in TABLE 17 below, 6 variants of each of the 16dimers are sufficient to represent all required compatible ends forco-assembly of four selected cassette pairs into each capture vector togenerate a 16-repeat array. The outside dimer building blocks (variants1, 4, 5 and 6 in the example of FIG. 32A) provide protruding ends thatmust be compatible with the ends of the first and second capture vectorsand can therefore not be recycled at other positions. However, theinternal building blocks (variants 2 and 3 in the example of FIG. 32A)can be allocated to positions 2 and 3 of both capture vectors,respectively, thereby reducing the amount of variants per vector to 6(instead of 8) which results in a total amount of 6×16=96 required dimerbuilding blocks. The compatibility of the protruding ends of eachvariant is shown in TABLE 17:

TABLE 17 letters a to f represent individual overhangs generated on the5′ or 3′ terminal ends of each builing block or capture vector Variant 1Variant 2 Variant 3 Variant 4 Variant 5 Variant 6 5′-a-----b-3′5′-b-----c-3′ 5′-c-----d-3′ 5′-d-----e-3′ 5′-e-----b-3′ 5′-d-----f-3′

As discussed above, the complexity of each multimer library can befurther reduced depending on the amount of repeats to be assembled. Forexample, a kit using a dimer library designed for the assembly of 12repeats may only require 5×16=80 dimer building blocks if three dimersare assembled into each capture vector. A kit with 5×16=80 dimerbuilding blocks may also be used to co-assemble 5 dimers into a10-repeat array in only one assembly step. In an alternative embodiment,more repeats can be assembled from a dimer library in the first step ifa third capture vector is available. In such case, three dimers would beassembled into each capture vector in the first step and the resultinghexamers would be combined into the functional vector. Many differentcombinations are feasible. However, it should be taken into account thatmore parallel reactions may be less user-friendly and the assembly maybecome more error-prone with an increasing amount of fragments to beco-assembled in each step. Thus, for the assembly of larger arrays(e.g., requiring more than three capture vectors in a first assemblystep), the use of a trimer or tetramer library as discussed above may bepreferred to limit the amount of reactions. Also the assembly of manylarge fragments should be avoided as the efficiency decreases withfragment length.

The number of variants per dimer building block that are required toassemble a given amount of repeats depends on the assembly strategy. Tocalculate a minimum set of variants the number of capture vectors in afirst assembly step and the number of building blocks co-assembled intoeach capture vector must be taken into account. The principle of theunderlying calculation is demonstrated in TABLE 18 below.

TABLE 18 BB_(CV2) BB_(CV3) n = BB_(tot) − CAS BB_(CV1) (CV2_(Int))(CV3_(Int)) BB_(tot) (CV2_(Int) + CV3_(Int)) n × 16 24 4 4 (2) 4 (2) 12=12 − (2 + 2) = 8  128 22 4 4 (2) 3 (1) 11 =11 − (2 + 1) = 8  128 20 4 3(1) 3 (1) 10 =10 − (1 + 1) = 8  128 18 3 3 (1) 3 (1) 9 =9 − (1 + 1) = 7112 16 4 4 (2) 0 8 =8 − (2 + 0) = 6 96 14 4 3 (1) 0 7 =7 − (1 + 0) = 696 12 3 3 (1) 0 6 =6 − (1 + 0) = 5 80 10 3 2 (0) 0 5 =5 − (0 + 0) = 5 80

The first column CAS indicates the total amount of required TAL bindingcassettes to be assembled. BB_(CV1), BB_(CV2) and BB_(CV3) indicate howmany building blocks are to be co-assembled into each of the threecapture vectors whereas BB_(tot) shows the total number of dimerbuilding blocks required in this first assembly step. The numbers inparentheses (CV2_(int) and CV3_(int)) indicate how many of the buildingblocks are internal building blocks and can therefore be recycled in thesecond and (if applicable) the third capture vectors. To calculate thenumber of variants required per dimer building block, the total amountof dimer building blocks can be reduced by the sum of building blocksthat can be recycled in CV2 and CV3, which can be expressed by theformular: n=BB_(tot1)−(CV2_(int)+CV3_(int)). Examples how to calculatethe amount of variants for different combinations are given for repeatarrays containing between 10 and 24 repeats which reflects a reasonablerange that can be covered based on a dimer library in a two-stepassembly process. Smaller repeat numbers can be assembled in one step,whereas for larger repeat numbers the above-described trimer or tetramerlibraries may be more useful. Also where half-repeats are to be includedor an odd number of repeats is to be assembled, these can e.g. beprovided in the terminal building block or in a functional vector.

Thus, the invention relates, in part, to a collection of n×16 dimerbuilding blocks, each of the 16 dimer building blocks carrying a definedpair of TAL binding cassettes wherein each TAL binding cassette isselected from one of at least four different categories of RVDs whicheach RVD binding preferably to a specific base in a target nucleic acidmolecule with

n=BB_(tot)−(CV2_(int)+CV3_(int)), wherein

BB_(tot) represents the total amount of dimer building blocks assembledinto CV1, CV2 and optionally CV3; and CV2_(int) and CV3_(int) representthe amount of internal dimer building blocks assembled into CV2 and CV3which do not have a protruding end compatible with one of the protrudingends of CV2 and CV3.

A universal assembly kit providing such collection of dimer buildingblocks as described above may further comprise the required amount ofcapture vectors for the envisaged assembly strategy and a functionalvector for two-step assembly of a TAL effector fusion. One example ofhow a universal TAL assembly kit can be presented is shown in FIG. 32B.In this example the dimer library is arranged in a 96-well plate toallow for systematic pipetting of the required building blocks. Inaddition, the kit may contain two or more restriction enzymes, a ligaseand respective buffer compositions required for type IIS-mediatedassembly and optionally competent bacteria for transformation ofassembled vectors. A universal assembly kit according to the embodimentsof the invention can be used according to any one of the two-stepassembly protocols described herein. The kit may be provided with one ormore protocols and specific instructions including troubleshooting foreach assembly approach.

A kit according to the embodiments of the invention may be furnishedwith one or more functional vectors. A functional vector provided in thekit may, e.g., carry a TAL effector fusion encoding a nuclease such as aFokI nuclease. Alternatively, a functional vector may carry anactivator, a repressor, an epigenetic modifier or may contain a multiplecloning site for insertion of an effector function provided by thecustomer. In certain aspects, a functional vector provided with the kitmay be an expression vector. The functional vector in the example ofFIG. 32B provides an expression cassette under control of a CMV promoterand a polyA site for expression of assembled TAL effector FokI nucleasein mammalian cells. The functional vector in the kit may also be atopoisomerase-adapted vector or a GATEWAY® Entry Clone or any otherfunctional vector described elsewhere herein.

In addition, some or all vectors included in the kit may contain acounter selectable marker gene. Any counter selectable marker gene thatallows for selection of correctly assembled capture vectors or TALeffector fusions may be used for that purpose. In one embodiment, theselectable marker gene may be ccdB. In another embodiment, theselectable marker gene may be tse2 or a modified functional versionthereof as described elsewhere herein. Vectors of the kit may carry thesame or different selectable marker genes or may be furnished with oneor more additional selection markers such as, e.g., an antibioticresistance expression cassette. Providing kit-related vectors with toxicselection markers such as, e.g., tse2 may increase the success rate ofcorrect assembly for customers using such kit and may also preventcommercial vector systems from being propagated and re-distributed bythe customer in the absence of a commercially available antidote system(such as, e.g., a Tsi2 expressing host cell), which is essential forservice provider to protect kit- or vector-associated revenues.

The universal assembly kit may further contain a control vector or avector expressing a reporter gene which indicates successful assembly.In addition, the kit may be combined with a reporter vector or one ofthe functional assays of the invention described herein to evaluate TALeffector binding and/or activity of a fused effector function in vitroor in vivo.

Thus, the invention also relates to a TAL assembly kit for typeIIS-mediated two-step assembly of a TAL effector characterized by atleast a first assembly step in the presence of a first capture vectorCV1, a second capture vector CV2 and optionally, a third capture vectorCV3, wherein said kit contains at least:

-   -   (d) a collection of n×16 dimer building blocks, each of the 16        dimer building blocks carrying a defined pair of TAL binding        cassettes wherein each TAL binding cassette is selected from one        of at least four different categories of RVDs which each RVD        binding preferably to a specific base in a target nucleic acid        molecule with

n=BB_(tot)−(CV2_(int)+CV3_(int)), wherein

-   -    BB_(tot) represents the total amount of dimer building blocks        assembled into CV1, CV2 and optionally CV3; and    -    CV2_(tot) and CV3_(tot) represent the amount of internal dimer        building blocks assembled into CV2 and CV3 which do not have a        protruding end compatible with one of the protruding ends of CV2        and CV3;    -   (e) at least a first capture vector CV1, a second capture vector        CV2 and optionally, a third capture vector CV3; and    -   (f) at least a first functional vector wherein said functional        vector may contain one or more additional TAL binding cassettes        or half cassettes.

The collection of dimer building blocks in (a) may be provided ascircular plasmids either in solution or in lyophilized form. In oneembodiment of the invention, the collection of dimer building blocks maybe provided in a multi-well plate such as e.g. a 96-well plate eitherseparate or as part of the kit. In certain embodiments of the invention,n is a number in the range of 5 to 8. The first, second and thirdcapture vectors in (b) may contain one or more selectable markers. Inyet another embodiment, at least one of the one or more selectablemarkers may be a counter selectable marker such as ccdB or tse2. Theselectable marker may be flanked by one or more type IIS restrictionenzyme cleavage sites. A second selectable marker may code for anantibiotic resistance.

The functional vector in (c) may encode an effector function such as anuclease, a repressor, an activator or an epigenetic modifier activity.Alternatively, the functional vector may contain a region for insertionsuch as a multiple cleavage site for insertion of another fusion moiety.

In addition, a kit according to the invention may contain one or more ofthe following components:

-   -   (g) a first type IIS restriction enzyme and a buffer composition        allowing for cleavage of a nucleic acid molecule containing a        recognition site for said first type IIS restriction enzyme;    -   (h) a second type IIS restriction enzyme and a buffer        composition allowing for cleavage of a nucleic acid molecule        containing a recognition site for said second type IIS        restriction enzyme;    -   (i) a ligase and a buffer composition allowing for ligation of        assembled nucleic acid molecules;    -   (j) an aliquot of competent bacteria for transformation of        assembled vectors such as, e.g., chemically competent or        electro-competent E. coli;    -   (k) a control vector with a selectable marker gene or a reporter        gene for validation of the assembly reaction; and    -   (l) any one of a functional binding assays described herein.

In a specific embodiment, the first type IIS restriction enzyme of (d)may be BsaI and the second type IIS restriction enzyme may be AarI. Inyet another embodiment, the ligase in (f) may be a T4 ligase. However,any other type IIS restriction enzyme or ligase suitable for using thekit according to one of the protocols of the invention can be includedin the kit.

TAL QC and Functional Analyses

TAL sequencing. In another aspect, the invention relates to qualitycontrol of assembled TAL effector coding nucleic acid sequences. Due tothe highly repetitive nature of TAL effector sequences, a qualitycontrol of assembled TAL repeats by sequencing from both ends ischallenging. For example, if a TAL effector contains 24 cassettes itwill not usually be a problem to sequence the first 10 or more repeatsfrom the one end of a vector and the last 10 or more repeats from theother end of the vector by designing specific sequence primers to bindwithin the vector backbone and read in opposite directions. To guaranteecomplete sequencing of the entire 24 repeat domain encoded by approx.2,450 nucleotides, at least one additional primer would have to bedesigned to bind to a target sequence located preferably near the centerof the plurality of assembled cassettes. This can, however, only berealized if a specific primer binding site can be identified in at leastone of the cassettes which is difficult due to the highly repetitivenucleotide sequence. One aspect of the invention provides a solution tothis problem by making use of the degeneracy of the genetic code. Bymodifying the codon composition within one or more cassettes, specificprimer binding sites can be provided without altering the encoded aminoacid sequence.

Thus, in one embodiment the library of cassettes for TAL effectorassembly contains at least one first cassette per category wherein thecodon composition of said first cassette differs from the codoncompositions of all other cassettes of the same category and whereinsaid cassette is allocated to only one distinct position in the seriesof cassettes and wherein said one distinct position is preferably aposition in the center or close to the center of the total amount ofcassette positions.

In another embodiment, the library of cassettes contains at least onesecond cassette per category wherein the codon composition of saidsecond cassette differs from the codon composition of the first cassetteand from the codon composition of all other cassettes of the samecategory and wherein said second cassette is allocated to only onedistinct position in the series of cassettes and wherein said onedistinct position is preferably a position in the center or close to thecenter of the total amount of cassette positions and is different fromthe position of the first cassette.

To generate cassettes with unique codon composition the codons can,e.g., be altered to use less preferred codons (e.g., the second best orthird best codon instead of the best codon) according to a given codonusage table as illustrated by the following example:

A 34-amino acid repeat capable of binding to nucleotide “A” via RVD “NI”has the following amino acid sequence:

(SEQ ID NO: 71) LTPEQVVAIASNIGGKQALETVQRLLPVLCQAHG

The same repeat sequence is encoded by all cassettes of the category “A”(cassette A1, A2, A3, A4, A5 etc.) which have been codon-optimized forexpression in human hosts.

A cassette A10 may have the following nucleic acid sequence:5′-CTGACCCCCGAACAGGTGGTGGCCATTGCCAGCAACATCGGCGGCAAGCAGGCCCTGGAAACCGTGCAGAGACTGCTGCCCGTGCTGTGCCAGGCCCATGGC-3′(SEQ ID NO: 72)

Another cassette A12 may have the following sequence:5′-TTGACTCCAGAACAGGTGGTGGCTATTGCTTCCAATATTGGGGGGAAACAGGCCCTGGAAACTGTGCAGCGCCTGCTGCCAGTGCTGTGCCAGGCTCACGGA-3′ (SEQ ID NO: 73)

A comparison of the 34 codons in cassettes A10 and A12 reveals that A10uses preferred codons (according to a human codon usage table) in 29 of34 cases and uses less preferred codons in 5 cases whereas A12 usespreferred codons in only 15 cases and less preferred codons in 19 cases.By using more less-preferred codons in at least one of the cassettes ofeach category, individual primer binding sites can be generated atdesired positions. The following alignment shows the different codoncompositions of cassettes A10 (upper sequence) and A12 (lower sequence)and one possible primer binding site highlighted in bold.

In yet another embodiment, all cassettes of a category may vary in codoncombination, e.g., when different ratios of preferred and non-preferredcodons are used for each cassette. Cassettes with unique codoncomposition may further be incorporated into larger building blocks liketrimers or tetramers as disclosed elsewhere herein. This strategy allowsfor robust sequencing of the center of larger TAL effectors.

TAL Library Screening: The invention also methods for generating andscreening TAL effector libraries, as well as compositions comprisingthese libraries, and individual members of these libraries. Adescription of some embodiments of this aspect of the invention is shownin FIG. 12A. FIG. 12A shows a vector based approach in which TAL nucleicacid binding cassettes encoded by a vector are separated from the vectorbackbone and each other by digestion with a restriction enzyme (Esp3I inthis instance) and then randomly assembled and introduced into a vectorbackbone to generate a library of nucleic acid binding cassettes whichbind to different nucleic acid sequences. These libraries may containTAL effectors in association with or without additional activities(e.g., transcriptional activation, nuclease, etc.). Further, librariesmay be constructed in a manner in which nucleic acid segments encodingthe members of the library may be operably linked to nucleic acidencoding additional activities. One method for doing this is to generateTAL effectors libraries and introduce the members of these librariesinto vectors so that the TAL effector coding sequences are flanked byrecombination sites (e.g., att sites). This allows for the librarymembers to be readily transferred to other nucleic acid molecules (e.g.,vectors) where they can become operably linked to nucleic acid encodingdifferent additional activities. One recombinational cloning systemdescribed herein which can be used in such processes is the GATEWAY®system. Of course, non-recombinational cloning systems can also be usedto operably link TAL effector library member to other nucleic acids,including standard restriction enzyme digestion, ligation methods.

FIG. 12B shows four nucleic acid segments which encode TAL nucleic acidbinding repeats that recognize each of the four DNA bases. One methodsfor generating random TAL nucleic acid binding repeats involves startingwith a linear vector which contains a coding region which encodes apartial first repeat (up to the Esp3I site). The linkage site for thefirst part TAL nucleic acid binding repeat can be designed so that aseamless connection occurs with the vector portion of the repeat.Linking may be facilitated by any number of means including the use ofligases and/or topoisomerases.

When generating TAL effector libraries conditions may be adjusted sothat the libraries have certain characteristics. For example, theconcentration ratio of repeating units to vector may be adjusted so asto arrive at a specified average number of repeats being present in eachcircularized vector. Of course, other methods may be used to achieve thesame goal, including limiting the amount of time that ligation ofrepeats is allowed to take place and size selection of either TALnucleic acid binding cassettes or vectors which contain these cassettes.Thus, TAL effector libraries may be generated wherein at least 75%(e.g., at least 80%, 85%, 90%, 95%) of the individual library memberscomprise from about 5 to about 50, from about 10 to about 40, from about10 to about 30, from about 10 to about 20, from about 10 to about 15,from about 12 to about 50, from about 12 to about 35, from about 12 toabout 25, from about 12 to about 20, from about 15 to about 35, fromabout 15 to about 30, from about 15 to about 25, from about 15 to about20, from about 17 to about 32, etc. TAL repeats.

Also, TAL effector libraries may be “biased” to increase the number ofindividual library members that will have binding specificity of nucleicacids with particular characteristics. As an example, AT/CG ratios varywith organism and regions of genomes within organisms. For example, if aTAL effector is sought which binds a nucleic acid region with a higherAT content than CG content, then the TAL effector library may bedesigned to reflect this. The invention thus includes TAL effectorlibraries which nucleic acid binding biases. In some embodiments, TALeffector libraries with have TAL repeats which are designed whichcontain from about 51% to about 80%, from about 55% to about 80%, fromabout 60% to about 80%, from about 51% to about 75%, from about 51% toabout 70%, from about 51% to about 65%, from about 51% to about 60%,from about 55% to about 80%, from about 55% to about 70%, from about 55%to about 65%, from about 55% to about 60%, adenine and thymine bindingrepeats. In other embodiments, TAL effector libraries with have TALrepeats which are designed which contain from about 51% to about 80%,from about 55% to about 80%, from about 60% to about 80%, from about 51%to about 75%, from about 51% to about 70%, from about 51% to about 65%,from about 51% to about 60%, from about 55% to about 80%, from about 55%to about 70%, from about 55% to about 65%, from about 55% to about 60%,cytidine and guanidine binding repeats. The invention further includesmethods for making such libraries, and compositions employed in suchmethods.

Screening of TAL effectors for binding activity can be performed by anynumber of methods. For purposes of illustration, the 5′ region of a genefor which a TAL effector fusion activator is sought may be placedupstream from a reported gene (e.g., green fluorescent protein,beta-galactosidase, etc.). Library nucleic acid molecule may then beintroduced into cells containing this reporter construct. The cells maythen be screened to identify those in which the report is activated. Theinvention thus includes methods for identify TAL effectors with bindingspecificity for specific nucleotide sequences. Methods of this type havethe following advantages: (1) In cases where the TAL effector format isfunctional within a particular cell type, only a single TAL effectorlibrary need be constructed for that cell type and (2) it may bepossible to identify TAL effectors with different “strengths” of bindingfor the nucleic acid region. This is so because, when a reporter assayis used, signal strength may correlate with binding strength.

TAL effector libraries, as well as other nucleic acid moleculesdescribed herein (e.g., nucleic acid encoding TAL effector fusionproteins) may be inserted into any number of vector types, includinglentiviral vectors, which allow for the delivery of one gene per cell.

Once introduced into cell, TAL effector libraries may be phenotypicallyscreened by selection, cell sorting or reporter assay etc. Further, TALeffector library members may be “rescued” from cells by PCR. Thetargeted DNA sequence can be identified by sequencing the rescued TALrepeats and may be used, for example, to guide the BLAST search againstgenomic databases to identify potential candidate targets. TAL effectorlibraries can be used for cell-based phenotypic screening in a widevariety of areas, such as neurodegeneration, infectious disease, cancer,and stem cells. Phenotypic screening using randomized TAL effectorlibraries may be used to identify novel functional genes or newtherapeutic targets.

Assay systems for evolved TAL effectors. The invention further includesassay systems and their use for functional evaluation of engineered TALeffector molecules that have been derived by the above describedevolution approaches.

Assays suitable to evaluate the function of TAL effector binding and/oractivity of TAL effector fusions in different hosts are described inFIGS. 13-18 and 24. In these examples reporter systems and minimalgenetic circuits were developed to analyse TAL effector function in E.coli (FIG. 13 and Example 4), in algae (FIG. 14 and Example 5) or inmammalian cell culture (FIGS. 15-18 and Examples 6, 7 and 8). Thus, theinvention also relates to genetically engineered TAL responsive celllines. Furthermore these examples demonstrate functionality of TALeffector activators (FIGS. 15 and 18A and FIG. 33B), TAL effectorrepressors (FIGS. 15 and 16, and FIG. 33C) and TAL effector nucleases invivo (FIGS. 17 and 18B) and in vitro (FIG. 19).

The invention also includes in vitro nucleic acid cleavage assays formeasuring TAL effector binding activity, which may be used, for example,TAL effector libraries. One exemplary work flow for such an assay isshown in FIG. 24. In this work flow, FokI TAL effector fusions areprepared by in vitro transcription and translation. These TAL effectorfusions are then contacted with nucleic acid containing TAL effectorbinding sites positioned such that binding of the TAL effector fusionsresults in FokI endonuclease activity. The amount of cleavage productgenerated is then measure. In the work flow shown in FIG. 24, cleavageproduct generation is measured by gel electrophoresis.

The invention thus includes in vitro nucleic acid activity (e.g.,nucleic acid cleavage, transcriptional activation, methylation,demethylation, etc.) assays for measuring TAL effector binding activitywhich involve (1) contacting one or more (e.g., one or two) TALeffectors with nucleic acid containing one or more (e.g., one or two)TAL effector binding site and (2) measuring TAL effector bindingactivity. In many instances, TAL effectors used in such assays will beTAL effector fusions and an activity associated with these fusions willbe measured. As an example, TAL effector fusions which contain atranscriptional activation domain may be contacted with nucleic acidcontaining a TAL effector binding site and, optionally, a promoter underconditions where TAL effector fusion binding results in the activationof transcription. In such instances, TAL effector binding may bemeasured by measuring the amount of transcription product produced.Other in vitro assays may also be employed making use of the ability,for example, of TAL effector binding to block, for examples, arestriction site, a transcriptional activation site, a methylation site,or a demethylation site.

In many in vitro assays for TAL effector binding, the affinity of theTAL effector for a particular nucleic acid may be measured. Theinvention thus includes methods and compositions for comparing thebinding affinity of two or more TAL effectors (e.g., a test TAL effectorand a control TAL effector). With reference to the work flow shown inFIG. 24, assays will often be “graded” in nature. By this is meant thatactivities levels may be scored as effectively none, high or somewherein between. For example, lanes 1 and 3 in FIG. 24 show nearly completenucleic acid cleavage and lanes 5, 7 and 9 show differing levels ofcleavage. Thus, the invention includes assays where a high levelactivity TAL effector control is used and the activity associated withother TAL effectors (test TAL effectors) are measured and compared tothe control. In many instances, test TAL effectors may have activities,for example, between 0 and 100%, 10 and 60%, 10 and 100%, 60 and 100%,80 and 100%, 50 and 90%, 40 and 80%, etc. of the activity of the controltest TAL effector.

In other variations of the invention, a control TAL effector is usedwhich has lower activity than at least some of the test TAL effectors.In such embodiments, the control TAL effector may represent an expectedmid-level activity and test TAL effectors have activities which may varyabove and below the activity of the control TAL effector. Using acontrol TAL effector activity adjusted to 100%, test TAL effectors mayhave activities which vary, for example, between 0 and 200%, 10 and200%, 40 and 150%, 50 and 150%, 30 and 180%, 20 and 180%, etc. of thecontrol TAL effector.

In one aspect, the invention relates to an assay for screening a libraryof TAL effector variants in E. coli. The library would be expressed inthe presence of a second plasmid carrying an inducible marker gene and aTAL binding site. The marker gene can be a toxic gene, such as, e.g.,ccdB or tse2—resulting in cell death upon successful expression.Expression of the marker gene can be induced, for example, by atemperature shift or can be induced by an inducible operon system knownin the art such as arabinose, galactose, lactose or the like.

In instances where the TAL effector has, e.g., nuclease activity, theassay can be set up to analyse two different TAL effector functions: ina first embodiment the assay is construed such that the results serve toevaluate whether a modified TAL effector is capable of binding a giventarget sequence included in the second plasmid. In this instance, afunctional nuclease reporter domain would be fused to the modified TALeffector library and selection would identify those TAL effectornucleases with binding specificity for the given target sequence.

In a second embodiment, the assay may be construed such that the resultsserve to evaluate whether a modified nuclease domain is capable ofcleaving a target sequence in the second plasmid to interfere with toxicgene expression. In this instance, a modified nuclease or nucleasedomain library may be fused to a functional TAL repeat reporter domainand selection would identify those TAL effector nucleases withfunctional nuclease binding domains. In both instances functional fusionproteins would be characterized by the TAL effector binding to thetarget site in the second plasmid and nuclease domain cleaving andinactivating the toxic gene which results in survival of only thosecells carrying a binding-site specific active TAL effector nucleases.

In a further aspect of the invention, the assay system can also bemodified to allow for evaluation of TAL effector activity wherein theeffector is a repressor such as, e.g., a lacI repressor binding to a lacoperon that controls expression of the selection marker gene. In yetanother aspect of the invention the assay system can be modified toallow for evaluation of TAL effector activity wherein the effector is anactivator such that the activation of another factor, e.g., neutralizesthe toxic activity of the selection marker. One example of carrying outthe invention would be a CcdA expressing cell wherein CcdA expressionitself is regulated by the activity of the TAL effector, e.g., a TALactivator protein.

Thus, the invention refers to an assay system allowing for evaluation ofmodified TAL effector activity wherein either a modified TAL effector iscombined with a functional reporter fusion or a functional reporterfusion is combined with a modified TAL effector and the TAL effectorvariant or a library of TAL effector variants are expressed in a hostorganism in the presence of a reporter system comprising at least one ormore TAL binding sites and a selectable marker gene, wherein theexpression of the selectable marker gene is regulated by the combinedactivity of the TAL effector and a functional effector fusion.

The assay may, e.g., be performed in a prokaryotic host such as E. coli.In some instances the effector fusion has nuclease, activator orrepressor activity. In one embodiment the selection marker is a toxicgene such as, e.g., ccdB or tse2. In some embodiments the selectionmarker may be under control of an operon such as a lac operon and theexpression of the selection marker may be repressed by anoperon-specific repressor such as lad. In a specific embodiment the hostcell may be a CcdA expressing cell and CcdA expression may be regulatedby the activity of the tested TAL effector or TAL effector fusion.

Assays for genomic locus modification and off-target detection. TALeffector nucleases as described above can be used to edit genomes byinducing double-strand breaks (DSB), which cells respond to with repairmechanisms. Non-homologous end joining (NHEJ) reconnects DNA from eitherside of a double-strand break where there is very little or no sequenceoverlap for annealing. This repair mechanism induces errors in thegenome via insertion, deletion, or chromosomal rearrangement; any sucherrors may render the gene products coded at that locationnon-functional. Because this activity can vary depending on the species,cell type, target gene, and nuclease used, it should be monitored whendesigning new systems. In addition to detection of activity at specifictarget loci, it is and will therefore become more important tounderstand off-target activity of TAL effector nucleases. The inventionprovides solutions for this problem as described by the followingapproaches.

Mismatch-detecting enzymes cleavage assay. To detect any differencebetween two alleles a simple heteroduplex cleavage assay can beperformed. A first aspect of the invention takes advantage ofmismatch-detecting enzymes, such as a mismatch-detecting enzymes derivedfrom Perkinsus marinus nuclease PA3 (PM PA3) (see, e.g., GeneBankAccession Nos. XP_002788902, XP_002788899, and XP_002782582) and Cel1,Res1 or similar, to identify modifications in the genome. Thus, in oneaspect the invention relates to a method to detect genomic locusmodification wherein the method is characterized by the stepsillustrated in FIG. 20. A detailed description of this assay is given inExample 9a.

A mismatch endonuclease is an endonuclease that recognizes mismatcheswithin double-stranded DNA, including mispairing and unpairedmismatches, and cleaves the DNA (cuts both strands of thedouble-stranded DNA) at the site of the mismatch in order to excise themismatch from the DNA. Depending on the mismatch endonuclease used, theendonuclease will cut the DNA either 5′ or 3′ to the mismatch. Apartfrom the above described enzymes, phage T4 endonuclease VII or T7endonuclease I have been shown to bind to DNA mismatches and cantherefore be used to efficiently detect genomic lesions caused by TALnuclease cleavage. Both enzymes have similar properties (Babon et al.The use of resolvases T4 endonuclease VII and T7 endonuclease I inmutation detection. Mol. Biotechnol. 23:73-81. (2003)) and are capableof recognizing and cleaving all eight types of single base mismatches(AA, CC, GG, TT, AC, AG, TC and TG) and DNA loop structures resultingfrom insertions or deletions (indels). Example 9b illustrates anembodiment of a mismatch detecting enzymes cleavage assay according tothe invention, wherein an efficient T7 endonuclease I enzyme mix wasused to detect mismatches caused by TAL nuclease cleavage. The enzymemix contains T7 endonuclease I in combination with a ligase such as,e.g., Taq ligase. The use of a ligase moderates the non-specific nickingactivity of T7 endonuclease I by repairing spurious nicks before adouble strand break occurs. This has the advantage of allowing higher T7endonuclease I concentrations and a wider range of input DNA while stillensuring complete specific cutting of all DNA mismatches. In thisrespect, for example Taq ligase has also been shown to moderate the nonspecific nicking activity of other mismatch endonucleases, including T4endonuclease VII.

Thus the invention relates, in part to an enzyme composition fordetection of mismatch cleavage containing at least an endonuclease thatis capable of recognizing and cleaving a mismatch in a DNA double strandand a ligase which is capable of repairing nicks generated bynon-specific activity of the endonuclease. In one embodiment, the DNAligase is Taq ligase. However, the DNA ligase may be any other ligasethat repairs nicks in a single strand of a double-stranded DNA. SuitableDNA ligases include, without limitation, AMPLIGASE™ (EpicentreBiotechnologies, Madison, Wis., USA)—a thermostable DNA ligase derivedfrom a thermophilic bacterium and catalyzes NAD-dependent ligation ofadjacent 3′-hydroxylated and 5(r)-phosphorylated termini in duplex DNA;9° N™ ligase (New England Biolabs, Ipswich, Me., USA)—a DNA ligaseactive at elevated temperatures (45-90° C.) that is isolated from athermophilic archaea Thermococcus sp.; T4 DNA ligase, Taq DNA ligase,and E. coli DNA ligase. Apart from a mismatch cleaving endonuclease anda DNA ligase capable of repairing nicks, the composition does notrequire any further enzymatic activities. Thus, in one embodiment, thecomposition does not include any further enzymes or enzymaticactivities.

To allow for complete cleavage of all mismatch DNA in a sample withoutleaving nicks due to non-specific endonuclease activity, two ratios areimportant: (i) the ratio of endonuclease to DNA substrate and (ii) theratio of endonuclease to ligase. At high endonuclease concentrations,the DNA is rapidly degraded whereas too low concentrations would notallow complete cleavage of mismatch DNA which would result in anunderestimation of TAL nuclease-mediated DNA editing. Most of the abovereferenced enzymes work at a broad temperature range. For example, T7endonuclease I and Taq ligase may be used at various temperatures from30° C. to 60° C. However the optimal temperature should be adjusted foreach individual enzyme combination as each enzyme has different activityprofiles across the temperature range. Also, the concentration of eachenzyme must be thoroughly adjusted. Shorter incubation times may beachieved by increasing the concentrations of the enzymes. A skilledperson can easily determine an appropriate amount of a particularendonuclease and DNA ligase required under certain reaction conditionsby conducting a time course experiment for various amounts of DNA.

In certain instances, the DNA ligase may be added after the treatmentwith the mismatch endonuclease is completed. Chemical or heatinactivation of the mismatch endonuclease may be used to ensure theendonuclease reaction is completed, or the buffer containing themismatch endonuclease may be exchanged, thus removing the mismatchendonuclease from the reaction. In other instances, it may beadvantageous to incubate the mismatch-carrying DNA with both enzymes atthe same time allowing the ligase to act for the whole period duringwhich the mismatch endonuclease is acting on the double stranded DNA.For this purpose the inventors have developed a ready to use enzymecomposition allowing for time-efficient treatment of DNA with bothenzymes. Treatment of mismatch nucleic acid is performed in a suitablereaction buffer that contains any coenzymes or counterions that may berequired for optimal endonuclease and DNA ligase activity. Where T7endonuclease I is used with Taq ligase a ready to use enzyme compositionaccording to the invention may, e.g., contain the following components:T7 endonuclease I and Taq ligase at a ratio of between 1:1 and 1:6(e.g., at a ratio of from about 1:1 to about 1:5, from about 1:2 toabout 1:5, from about 1:3 to about 1:5, from about 1:3.5 to about 1:5,from about 1:3.5 to about 1:4.5, etc.), in a Tris pH 7.4 buffer systemsupplied with KCl, EDTA, glycerol, BSA and Triton X-100. In one specificembodiment 100 μl of the enzyme composition contain 10 μl of T7endonuclease I (10 U/μl) and 10 μl of Taq ligase (40 U/μl) (both NewEngland Biolabs, Beverly, Mass.) and 80 μl of an enzyme dilution bufferconsisting of 10 mM Tris pH 7.4 at 4° C., 50 mM KCl, 0.1 mM EDTA, 50%glycerol, 200 μg BSA/ml, 0.15% Triton X-100). A detailed description ofa mismatch cleavage assay using such enzyme composition is given inExample 9b.

ChIP-seq assays. ChIP (chromatin immunoprecipitation) is an efficientmethod to selectively enrich for DNA sequences bound by a particularprotein in living cells. The ChIP process enriches specific crosslinkedDNA-protein complexes using an antibody against a protein of interest.Oligonucleotide adapters are then added to the small stretches of DNAthat were bound to the protein of interest to enable massively parallelsequencing (ChIP Seq). After size selection, all the resulting ChIP-DNAfragments are sequenced simultaneously using a genome sequencer. Asingle sequencing run can scan for genome-wide associations with highresolution, meaning that features can be located precisely on thechromosomes.

The inventors have combined the ChIPSeq assay with the specific bindingactivity of DNA repair protein 53BP1 to map nucleotide lesions in TALeffector nuclease treated cells. Thus, in one aspect the inventionrelates to a method for mapping lesions wherein the method ischaracterized by the following steps: (i) subjecting cells treated witha TAL effector nuclease and untreated cells to immune chromatinimmunoprecipitation with an anti-53BP1 antibody, (ii) crosslinking thecomplex with the DNA (iii) shearing the complex, and (iv) pulling downthe complex with a second antibody, (v) optionally, separating the boundDNA from the antibody complex, (vi) performing a high throughputsequencing reaction, and (vii) comparing the sequence profiles with thepredicted target site sequence by computer-aided homology analysis. Thelast step can help to exclude false results due to naturally occurring,spontaneous double stranded breaks or other DNA damage which recruitrepair proteins that are present in the genome which would be scored asa lesion in this assay. Thus, the invention provides methods forassessing whether nucleotide sequence discrepancies are present in TALeffector coding sequences.

Site-Specific Integration. One application of the invention relates theintegration of desired nucleic acid segments or regions into cellularnucleic acid molecules (e.g., intracellular plasmids, chromosomes,plastid genomes, etc.). Nucleic acid integration may be site specific orrandom.

Site specific integration methods will typically involved the following:(1) The selection of a target site, (2) the design and/or production ofa TAL effector fusion which interacts at or near the target site, and(3) a desired nucleic acid segment or region for integration into thetarget site.

Any number of criteria may be used for target site selection. Asexamples, the target site may be (1) known in the particular cell to bea region of open chromatin structure or (2) directly associated withcellular nucleic acid (e.g., a promoter and/or an enhancer) known toconfer a particular function (e.g., transcriptional activation) uponnucleic acid at the integration site. Target site selection will varywith the particular cell, the specific application, informationavailable about known potential integration sites, the desires to eitherdisrupt or not disrupt cellular nucleic acid which confer upon the cellparticular functional activities, and the nucleic acid segment or regionfor which integration is sought.

In some instances, it may be desirable to integrate nucleic acid at alocation in cellular nucleic acid which is either known to not have openchromatin structure or where the chromatin structure is not know. Oneexample of such a situation is where it is desirable to insert the samenucleic acid segment or region into the same location in cells ofdifferent types (e.g., cell of different tissues from the same plant oranimal). In such instances, it may be desirable to employ an agentdesigned to alter chromatin regions. One example of a chromatinremodelling composition is a TAL effector fused to a chromatinremodeling complex protein.

A number of chromatin remodeling complexes are known. Chromatinremodeling complexes generally contain an enzymatic component, which isoften an ATPase, a histone acetyl transferase or a histone deacetylase.ATPase components include, but are not limited to, the followingpolypeptides: SWI2/SNF2, Mi-2, ISWI, BRM, BRG/BAF, Chd-1, Chd-2, Chd-3,Chd-4 and Mot-1. Additional non-enzymatic components, involved inpositioning the enzymatic component with respect to its substrate and/orfor interaction with other proteins, are also present in chromatinremodeling complexes and can be used as a portion of a fusion molecule.

Modification of chromatin structure will facilitate many processes thatrequire access to cellular DNA. In some embodiments, chromatinmodification facilitates modulation of expression of a gene of interest.Modulation of expression comprises activation or repression of a gene ofinterest. In additional embodiments, chromatin modification facilitatesrecombination between an exogenous nucleic acid and cellular chromatin.In this way, targeted integration of transgenes is accomplished moreefficiently.

Typically, when TAL effector fusions are designed to remodel chromatin,they will have a recognition sequence near the chromatin region forwhich remodelling is desired. In many instances, the chromatinremodelling TAL effector will bind to cellular nucleic acid within 500nucleotides (e.g., from about 10 to about 500, from about 30 to about500, from about 70 to about 500, from about 100 to about 500, from about150 to about 500, from about 200 to about 500, from about 250 to about500, from about 300 to about 500, from about 10 to about 400, from about10 to about 300, from about 10 to about 200, from about 100 to about200, from about 100 to about 400, etc.) the target site (e.g.,double-stranded break site).

In many instances, methods of the invention will involve the use of aTAL effector fusion which creates a double-stranded break in a cellularnucleic acid molecule. Examples of such TAL effector fusion are providedelsewhere herein and will normally have a nuclease activity.

TAL effector nucleases of the invention allow for efficientsite-specific integration of a gene or expression cassette of interestinto a selected genetic locus of a cell. In those instances, where areliable and predictable as well as safe expression of an integratedgene is to be achieved, the genetic target locus will often fulfill thefollowing requirements: (i) locus disruption should not induce adverseeffects or insertional oncogenesis on the engineered cell or organismand (ii) allow for active and steady transcription from the insertedgene or expression cassette. Genetic loci fulfilling those requirementsacross cell types are referred to as “safe harbor loci”. Safe harborloci are defined as genomic locations that maintain high levels of geneexpression and are not appreciably silenced during development. Suchloci have been identified in all sorts of organisms and can be targetedand used to express heterologous genes in a stable fashion. Heterologousgenes inserted into intragenic loci can either be inserted in theabsence of a promoter thus relying on the natural promoter of said locusor may be inserted in the context of additional components as describedbelow such, as e.g., a heterologous promoter which may be a constitutiveor an inducible promoter as outlined elsewhere herein. In the mouse, alocus known as Rosa26 locus meets these criteria because it is expressedin embryonic stem cells and many derivative tissues both in vitro and invivo and genetic cargo can be easily integrated through homologousrecombination why it is used as a standard locus for transgenesis inmurine embryonic stem cells (Soriano P. Generalized lacZ expression withthe ROSA26 Cre reporter strain. Nature Genetics, 21, 70-71 (1999)).Potential safe harbor loci in the human genome include, e.g., the ColA1locus (Bead et al. Efficient method to generate single-copy transgenicmice by site-specific integration in embryonic stem cells Genesis,44(1):23-28 (2006)) and the adeno-associated virus site 1 or so-calledAAVS1 locus on chromosome 19 based on the observed repeated integrationof wild-type adeno-associated virus into said locus. Integration intothis locus disrupts the gene phosphate 1 regulatory subunit 12C(PPP1R12C) which encodes a protein of yet unclear function. Genesintegrated into AAVS1 have been shown to be reliably transcribed in allprimary human cells as well as common transformed cell lines such asHEK293, HeLa or Hep3B cells. Furthermore, embryonic stem cells andinduced pluripotent stem cells retained pluripotency when targeted atthe AAVS1 locus with Zn-finger nucleases (Hockemeyer et al. Efficienttargeting of expressed and silent genes in human ESCs and iPSCs usingzinc-finger nucleases. Nature Biotechnology 27, 851-857 (2009)). Otherhuman loci that may qualify as safe harbor integration sites includeCCR5 which encodes the major co-receptor of HIV-1 (Lombardo et al.Site-specific integration and tailoring of cassette design forsustainable gene transfer. Nature Methods 8, 861-869 (2011)), human ROSA26 named after the homologous murine ROSA 26 locus (Trion et al.Identification and targeting of the ROSA26 locus in human embryonic stemcells. Nature Biotechnology 25, 1477-1482 (2007)) both of which arelocated on chromosome 3, the hypoxanthine phosphoribosyltransferase 1(HPRT) locus on the X chromosome (Sakurai et al. Efficient integrationof transgenes into a defined locus in human embryonic stem cells.Nucleic Acids Research 38(7):e96 (2010)) and a locus detected as ahotspot for phiC31 recombinase on chromosome 13 located in an intronicregion of the CYLBL gene (Liu et al. Generation of Platform HumanEmbryonic Stem Cell Lines That Allow Efficient Targeting at aPredetermined Genomic Location. Stem Cells Dev, 18(10), 1459-1472(2009)). Further loci in the human genome that may be safely targeted byTAL effector nucleases according to methods of the invention includeloci 2p16.1 on chromosome 2, 3p12.2 or 3p24.1 on chromosome 3, 6p25.1 or6p12.2 on chromosome 6, 7q31.2 on chromosome 7, 12q21.2 on chromosome12, 13q34 on chromosome 13, 21q21.1 on chromosome 21.

The inventors have chosen some of the characterized human and murinesafe harbor loci and have constructed and validated high efficiency TALeffector FokI nuclease pairs specifically targeting those loci. Genomictarget sites for some of these TAL nuclease pairs are listed in TABLE 19below:

TABLE 19 Exemplary target binding sites for TAL  nuclease pairsForward TAL Reverse TAL nuclease nuclease Locus target site target siteAAVS1 5′-TTATCTCACAGGT 5′-TCTAGTCCCCAAT (human) AAAACT-3′ TTATAT-3′(SEQ ID NO: 74) (SEQ ID NO: 75) HPRT 5′-TCTAGCCAGAGTC 5′-TCAGCCCCAGTCC(human) TTGCAT-3′ ATTACC-3′ (SEQ ID NO: 76) (SEQ ID NO: 77) CYLBL5′-TGACTGCAATTTG 5′-TGAACATGAATCT (human) CATCTT-3′ CAGGGC-3′(SEQ ID NO: 78) SEQ ID NO: 79 R0SA26 5′-TCGTGATCTGCAA 5′-TGCCCAGAAGACT(mouse) CTCCAG-3′ CCCGCC-3′ (SEQ ID NO: 80) (SEQ ID NO: 81)

Thus, in one aspect the invention relates to a TAL nuclease targeting asafe harbor locus. In certain embodiments, the safe harbor locus isselected from a mammalian or human safe harbor locus such as, e.g.,AAVS1, HPRT, CYLBL or ROSA26 and the genomic target binding sites forthe respective TAL nuclease pairs are defined by the forward and reversetarget sites listed in TABLE 19.

In another aspect the invention relates to a kit or vector systemallowing for targeted integration of a nucleic acid segment or regioninto a safe harbor locus of a mammalian or human cell, wherein said kitor vector system may comprise at least the following components:

-   -   (a) a first expression vector carrying a first TAL effector        fused to a first nuclease cleavage half-domain wherein the first        TAL effector binds a first target site within a safe harbor        locus of a mammalian or human cell,    -   (b) a second expression vector carrying a second TAL effector        fused to a second nuclease cleavage half-domain wherein the        second TAL effector binds a second target site within said safe        harbor locus of a mammalian or human cell and wherein the second        nuclease cleavage half-site is capable of dimerizing with said        first nuclease cleavage-half site to form a functional dimer,        and    -   (c) a third vector carrying a nucleic acid segment, gene or        expression cassette to be inserted into said safe harbor locus.

Alternatively, vectors (a) and (b) may be replaced by any of the vectorsfor TAL delivery described below or depicted in FIG. 22 allowing forco-expression of TAL nuclease cleavage half-domains from a singlevector. The third vector of (c) carrying a nucleic acid segment, gene orexpression cassette for integration may provide homology arms that matchwith the target sites of the genomic locus as further specified below.In certain instances, it may be desired that said third vector is anon-expression vector which does not allow for expression of thedelivered nucleic acid segment or gene prior to integration into thesafe harbor locus. The vector system of the invention may be used toco-transfect mammalian or human cells or cell lines according tostandard techniques resulting in concurrent expression of the first andsecond TAL nuclease half-domains, The TAL nuclease half-domains willdimerize and create a double strand break at the safe harbor locus andthe homology arm regions provided by the third vector will recombinespecifically with homologous sequences juxtaposed to the break therebyinserting the nucleic acid segment, gene or expression cassette.

The nucleic acid segment or region for integration into the target siteis sought may have any number of components. Examples of such componentsinclude at least one promoter (e.g., a RNA polymerase I, II or IIIpromoter), at least one enhancer, at least one selectable marker (e.g.,a positive and/or negative selectable marker), and/or one of more regionof sequence homology with cellular nucleic acid. The nucleic acidsegment or region may encode a protein product or a functional RNA(e.g., a short hairpin RNA molecule or other short interfering RNAmolecule, a microRNA, etc.).

In certain instances, the nucleic acid segment for integration mayencode a fluorescent or other detectably labelled fusion protein.Expression of fluorescent or other detectably labelled fusion proteinsmay serve different purposes including, e.g., the labelling of cellularstructures in living cells. Such fluorescent or other detectablylabelled fusion proteins can be introduced into target cells by variousmeans. For example, certain fluorescent cellular markers referred to asCELLLIGHT® (Life Technologies, Carlsbad, Calif.) are introduced intotarget cells via the BacMam technology. These baculo vectors encode acellular marker protein (known to associate with specific cellularstructures) fused to a fluorescent protein (such as GFP, RFP, CFP etc.).Following baculo-based transduction, the fusion protein is expressed andassociates with its target structure allowing for live-cell imaging ofthe targeted cellular structure by means of the fused fluorescentmoiety. One major drawback of baculovirus technology is transient andtherefore limited expression of the transduced fusion protein. Incertain instances, however, it may be desired to achieve a stableexpression of a fluorescent marker protein, e.g., to allow longtermobservation of cellular structures and associated developments. Stableexpression of the respective fluorescent marker protein can be achievedby using TAL nuclease-mediated site-specific integration. A TAL nucleaseaccording to the invention may be used to specifically integrate asingle fluorescent or other detectably labelled fusion protein into anoncoding region or a safe harbor locus of the genome, eliminatingundesired effects resulting from random insertion and variable copynumber.

Examples of marker proteins known to associate with specific structuresof human cells are indicated in TABLE 20 below. Any such marker proteincan be combined with any fluorescent protein suitable for live-cellimaging to generate a fluorescent fusion protein for specific celllabeling.

TABLE 20 Labeled Cellular Structure Marker Protein Actin Human actinEarly endosomes Rab5a Late endosomes Rab7a Endoplasmic ER signalsequence of calreticulin and reticulum (ER) KDEL (ER retention signal)Golgi Human Golgi-resident enzyme N- acetylgalactosaminyltransferase 2Histones Histone 2B Lysosomes Lamp1 (lysosomal associated membraneprotein 1) MAP4 MAP4 Mitochondria Leader sequence of E1 alpha pyruvatedehydrogenase Nucleus SV40 nuclear localization sequence PeroxisomesPeroxisomal C-terminal targeting sequence Plasma membraneMyristolyation/palmitoylation sequence from Lck tyrosine kinase Synapticvesicles Synaptophysin Talin Human c-terminus of talin Tubulin Humantubulin

Such fluorescent fusion protein may be encoded on a plasmid vectorco-delivered to the target cell with a TAL nuclease pair designed tointroduce double-strand breaks at the target locus. Thus, in a firstembodiment the invention relates, in part, to a vector carrying anexpression cassette to be inserted into the genome of a mammalian orhuman cell, wherein the expression cassette encodes a fluorescent fusionprotein and the vector further provides homology arms that match withthe target sites of the genomic locus. A vector according to such firstembodiment may encode one of the marker proteins listed in TABLE 20fused to a sequence encoding a fluorescent protein selected from greenfluorescent protein (GFP) or enhanced green fluorescent protein (EGFP),red fluorescent protein (RFP), blue fluorescent protein (BFP), cyanfluorescent protein (CFP), yellow fluorescent protein (YFP) orviolet-excitable green fluorescent protein (Sapphire). Based on thefolding requirements of the marker protein the fluorescent protein mayeither be fused to the marker's amino- or carboxylterminal end and maybe separated by a flexible linker such as, e.g., a glycine-serinelinker. In a second alternative embodiment, the vector encodes afluorescent protein sequence and an engineered insertion site forinsertion of a marker sequence of interest (e.g., encoding one of themarkers listed in TABLE 20). The marker sequence may be inserted into avector of such second embodiment by any of the various means describedelsewhere herein including type II or type IIS restriction enzymecleavage or recombination. Thus, such vector may for example be aGateway vector allowing for insertion of the marker gene via att-sitemediated recombination. Vector according to such first or secondembodiment may further be provided as part of a kit or vector system.

Thus, the invention also relates to a kit or vector system allowing fortargeted integration of an expression cassette encoding a fluorescent orother detectably labelled fusion protein into the genome of a mammalianor human cell, wherein said kit or vector system comprises at least thefollowing components:

-   -   (a) a first expression vector carrying a first TAL effector        fused to a first nuclease cleavage half-domain wherein the first        TAL effector binds a first target site within the genome of a        mammalian or human cell,    -   (b) a second expression vector carrying a second TAL effector        fused to a second nuclease cleavage half-domain wherein the        second TAL effector binds a second target site within the genome        of said mammalian or human cell and wherein the second nuclease        cleavage half-site is capable of dimerizing with said first        nuclease cleavage-half site to form a functional dimer, and    -   (c) a third vector according to the first or second embodiment        described above.

Such kit or vector system may be used to create cell lines or wholeorganisms stably expressing a fluorescent or other detectably labelledprotein fused to any desired marker gene. Vectors encoding fluorescentor other detectably labelled fusion proteins or any other nucleic acidsegment subject to site-specific integration will be equipped withhomology regions to allow for homologous recombination into the targetlocus of the cell.

FIG. 21 shows an example of a single-site homologous process. In thisprocess, there is a region of homology between one end of the nucleicacid segment or region for integration and the cellular nucleic acid.Thus, homologous recombination occurs at one of the nucleic acid segmentor region for integration and another joining method (e.g.,non-homologous end joining) occurs between cellular nucleic acid and theother end of the nucleic acid segment or region.

The length region of shared sequence homology and the amount of sequenceidentity between the two regions may vary greatly. Typically, the higherthe degree of sequence identity between two nucleic acid molecules, theshorter the regions of shared homology need to be for efficienthomologous recombination. Thus, there are at least three parameters forconsideration: (1) The degree of sequence identity between thehomologous regions of the two nucleic acids, (2) the length of theshared region of sequence homology, and (3) the efficiency of thehomologous recombination process.

In many instances, it will be desirable for homologous recombination tooccur with high efficiency. However, if a selection marker is includedin the nucleic acid segment or region for integration, then high levelsof homologous recombination may not be needed. Further, lower levels ofhomologous recombination may be acceptable when a single construct isintegrated into cellular nucleic acid of different cell types (e.g.,cell from different species). In such instances, it may be desirable tohave single integration construct, designed to be capable of undergoinghomologous recombination with multiple cell types, and accept lowerlevels of homologous recombination in one or more of the cell types.

The lengths of the regions of shared homology may vary greatly buttypically will be between 10 and 2,000 nucleotides (e.g., from about 10to about 2,000, from about 50 to about 2,000, from about 100 to about2,000, from about 200 to about 2,000, from about 400 to about 2,000,from about 500 to about 2,000, from about 10 to about 1,500, from about10 to about 1,000, from about 10 to about 500, from about 50 to about1,500, from about 100 to about 1,000, from about 200 to about 1,500,from about 200 to about 1,000, etc.) nucleotides. Also, the percentidentity between the shared regions will typically be greater than 80%(e.g., from about 80% to about 99%, from about 80% to about 95%, fromabout 80% to about 90%, from about 85% to about 99%, from about 90% toabout 99%, from about 90% to about 95%, etc.) sequence identity.Typically, there will be an inverse correlation between the level ofidentity and the amount of sequence identity of the shared sequences.

The invention also includes multiple site homologous recombinationsystems. Single-site homologous recombination systems generally resultin the insertion of a nucleic acid segment or region into cellularnucleic acid and two site homologous recombination systems generallyresult in the replacement of cellular nucleic acid with the integratednucleic acid segment or region.

Selection systems for enrichment of TAL-nuclease modified cells.Nucleases used to create double-stranded DNA breaks for site specificintegration may be active as dimers as described above. Thus, TALnucleases such as, e.g., TAL-FokI nuclease are designed in pairs, whereeach nuclease cleavage half domain is fused to a TAL effector withdifferent binding specificity to allow simultaneous binding of both TALmoieties to opposing DNA target half-sites separated by a spacer.Binding of the TAL FokI nuclease to their DNA target allows the FokImonomers to dimerize resulting in a functional enzyme that will create aDNA double strand break. However, editing of the genome at specific lociin chromosomal DNA by a modifying agent such as a TAL nuclease can varyin efficiency in response to many factors. Delivery of the engineeringagent into the cell (transfection), expression of the agent, anddelivery into the nucleus are just the first steps. Engineering agentswhich are delivered to the nucleus must find and bind the specific lociin the genome, the efficiency of which is determined by the state of thelocus (availability due to chromatin formation) and affinity of theagent for the binding site. TAL nucleases, for instance can havecleavage efficiency anywhere between 2% and 50% as a result of thecombined effect of all these factors. One bottleneck in TALnuclease-mediated cell engineering is the lack of systems to enrich orselect modified cells. Based on the low cleavage efficiency it usuallyrequires laborious screening of many clones in order to identify thosecells that have been modified by the respective TAL nuclease which makeonly a minor fraction within a pool of cells.

Cells may be sorted or separated by various means. One popular method iscell sorting via flow cytometry which allows for physical separation ofsub-populations of cells from a heterogeneous population. The advantageof cell sorting based on flow cytometry is that it is able to usemultiparametric analysis to identify highly specific populations.Moreover, it is not just phenotypic characteristics (size, granularityetc.) that can be measured; but also possible to measure the content ofnucleic acids within cells, or even assess functional characteristicssuch as ion flux or pH or altered cell states such as apoptosis and celldeath. Flow cytometry may also be used to isolate or sort cellsexpressing fluorescent reporter proteins. Apart from the well-knowngreen fluorescent protein derived from Aequorea victoria, many otherengineered or improved fluorescent proteins are meanwhile availableproviding a broad spectrum of colors with distinct excitation andemission maxima. Examples of each of the main color classes include redfluorescent protein (RFP), blue fluorescent protein known as BFP (Heimet al. Wavelength mutations and posttranslational autoxidation of greenfluorescent protein. Proc Natl Acad Sci USA. 91(26):12501-4 (1994); Heimand Tsien. Engineering green fluorescent protein for improvedbrightness, longer wavelengths and fluorescence resonance energytransfer. Curr Biol. 6(2):178-82 (1996)); cyan fluorescent protein knownas CFP (Heim and Tsien. Engineering green fluorescent protein forimproved brightness, longer wavelengths and fluorescence resonanceenergy transfer. Curr Biol. 6(2):178-82 (1996); Tsien R Y. The greenfluorescent protein. Annu Rev Biochem. 67:509-44. Review. (1998));yellow fluorescent protein known as YFP (Oruro et al. Crystal structureof the Aequorea victoria green fluorescent protein. Science.273(5280):1392-5. (1996); Wachter et al. Structural basis of spectralshifts in the yellow-emission variants of green fluorescent protein.Structure. 6(10):1267-77. (1998)); violet-excitable green fluorescentvariant known as Sapphire (Tsien R Y. The green fluorescent protein.Annu Rev Biochem. 67:509-44. Review. (1998); Zapata-Hommer andGriesbeck. Efficiently folding and circularly permuted variants of theSapphire mutant of GFP. BMC Biotechnol. 3:5. Epub (2003)); andcyan-excitable green fluorescent variant known as enhanced greenfluorescent protein or EGFP (Yang et al. Optimized codon usage andchromophore mutations provide enhanced sensitivity with the greenfluorescent protein. Nucleic Acids Res. 24(22):4592-3. (1996)). Besidessorting of cells expressing a particular fluorescent protein, theselection may also rely on the close co-localization or interaction oftwo proteins each fused to a different fluorescent protein by atechnique referred to as FRET (fluorescence resonance energy transfer).FRET requires a distance- and orientation-dependent transfer ofexcitation energy from a donor fluorophore to an acceptor chromophore.Accordingly, by expressing the donor fluorescent protein as a fusionwith one protein-of-interest and the acceptor fluorescent protein as afusion with a second protein-of-interest, the distance between the twoproteins-of-interest can be inferred from the FRET efficiency measuredusing, e.g., live cell fluorescence microscopy.

Another way to sort cells is to use magnetic beads. It is possible topositively select cells of interest by adding antibodies or otherbinding molecules (such as, e.g., a receptor) coupled to magnetic beadsto specifically select the population of interest, or by negativelyselecting cells by adding labeled antibodies specific for cells otherthan those of interest. Cells may then be passed through a columnbetween a strong magnetic field to either elute or retard a populationof interest. One example of magnetic separation known asmagnetic-activated cell sorting (MACS® Technology, Miltenyi Biotec,Bisley, UK) is used to isolate transiently transfected cells expressingthe gene of interest together with a cotransfected cell surface markergene, The MACS® methodology allows the separation of cells expressingsaid surface marker from those lacking the marker. The cell surfacemarker could be either introduced into cells by DNA-mediated genetransfer techniques as disclosed elsewhere herein or be a surfaceprotein that is endogenously expressed by the cell or cell type. Cellsexpressing said surface marker protein are then selected with specificantibodies attached to a magnetic matrix by applying a magnetic fieldunder appropriate experimental conditions. The system can be used forany cell surface marker for which a suitable antibody is available.Typical surface markers of mammalian or human cells for which commercialantibodies are available include e.g. CD2, CD3, CD4, CCR5, CD8,CD11a/LFA-1, CD11b, CD11c, CD13, CD14, CD15, CD16, CD18, CD19, CD20,CD23, CD25, CD27, CD28, CD31, CD33, CD34, CD38, CD40, CD44, CD45,CD45RA, CD45RO, CD54, CD56, CD62L, CD69, CD79a, CD80, CD83, CD86, CD94,CD95, CD117, CD123, CD127, CD138, CD161, CD195, DC-SIGN, CTLA-4, orvarious MHC class I or MHC class II markers such as HLA-DR, HLA-F, If nolabeled commercial antibody against a particular surface marker isavailable, cells may also be labeled with a primary unconjugatedantibody or serum and then bound by a labelled secondary antibodydirected to the Fc part of the primary antibody. Alternatively, theprimary antibody may also by biotinylated or fluorochrome-conjugated andbound in a second step by an anti-fluorochrome antibody or streptavidinbound to magnetic particles.

Cell enrichment using surrogate reporters. Cleavage of a specific locusis detected by the creation of a lesion (indel) which leaves a mutationin the genomic sequence and, if placed in an open reading frame, mayoften cause a frameshift gene knock out. In order to enrich for cellsthat have a high concentration of active TAL nucleases and thus, a highlikelihood of carrying such lesions, the frameshifting activity of theerror-prone nonhomologus end-joining (NHEJ)-mediated repair mechanismcan be used to activate reporter genes in transiently expressed vectors.For this purpose, a TAL nuclease pair can be co-delivered into a cellwith a “surrogate” reporter construct carrying an expression cassette,wherein said expression cassette contains in 5′ to 3′ direction at leasta first selectable marker gene, a left and right TALE binding half siteseparated by a spacer and a second selectable marker gene, and whereinthe first and second selectable marker genes are expressed under thecontrol of a single promoter. The reading frame encoding the firstselectable marker is different from the reading frame encoding thesecond selectable marker so that in the absence of a functional TALnuclease only the first selectable marker is expressed. Those cellsexpressing a functional nuclease dimer will allow for introduction ofnuclease-mediated double-strand breaks in the spacer region of thesurrogate reporter's target sequence. The break will then be repaired byNHEJ, resulting in a frameshift mutation in approximately one third ofcases which places the second selectable marker gene in the same readingframe with the first selectable marker gene and thus allows for theexpression of both selectable markers. Cells carrying a modifiedsurrogate reporter can therefore be selected via expression of thesecond selectable marker. The first and selectable marker genes may beof the same or different nature. In a first embodiment, both the firstand selectable marker genes may encode different fluorescent proteins asdescribed above. For example the first selectable marker may be GFP andthe second selectable marker may be RFP or vice versa. In suchembodiment, cells expressing the second selectable marker may beselected by flow cytometry or by fluorescence microscopy as describedabove. Alternatively, the first selectable marker gene may encode afluorescent protein and the second selectable marker gene may encode aresistance marker such as, e.g., a hygromycin resistance. In this case,modified cells expressing the resistance marker can be put underselective pressure to grow in the presence of the respective antibiotic.In yet another embodiment, the first selectable marker may be afluorescent protein and the second selectable marker may be one of theabove described cell surface markers. To allow separation from thefusion protein and transport to the cell surface, the surface marker maybe fused to a T2A translational cleavage site or other cleavage siteswith similar function. Modified cells expressing a surface marker canthen be sorted as described above, e.g., via magnetic beads carryingsurface-marker specific antibodies. Because the surrogate reporter willbe mainly modified in those cells exhibiting a high concentration offunctional TAL nuclease pairs, this method allows for the efficientenrichment of cells that are likely carrying a nuclease-modified genome.Furthermore, the episomal surrogate reporter system is non invasive,does not interfere with TAL nuclease activity and will be diluted outafter a few cell divisions which makes it an attractive and efficienttool for cell enrichment.

Tse2/Tsi2 selectable marker system for enrichment of TAL nucleasemodified cells. Apart from positive cell enrichment via fluorescence orsurface marker expression, all of which require additional separation orisolation steps, cells may also be selected by negative selection, i.e.,removing all cells that do not carry functional TAL nuclease pairs. Theinventors have developed two expression systems which rely on theTse2/Tsi2 selectable marker system which depends on the interactionbetween the toxin Tse2 and the antidote Tsi2. Whereas the expression ofthe cellular toxin Tse2 results in cell death (in many prokaryotic andeukaryotic cells) the co-expression of Tsi2 will restore cell viability(as described in detail elsewhere herein).

In a first embodiment, the invention relates to an expression systemcomprising at least a first and second vector expressing TAL nucleasecleavage half domains and a third vector functioning as a “surrogatereporter” as defined above. The surrogate reporter vector may comprisein a 5′ to 3′ direction a Tse2 coding sequence, both TALE effectortarget sites (left and right half-side) separated by a spacer, aself-cleavage sequence, and a Tsi2 coding sequence. An example of suchsurrogate reporter vector is shown in FIG. 34A. Whereas Tse2 isconstitutively expressed (e.g., from a weak promoter such as PGK or SV40etc), the sequence encoding Tsi2 is placed out of frame so that no Tsi2can be produced. Those cells expressing a functional nuclease dimer willallow for introduction of nuclease-mediated double-strand breaks intothe target sequence of the surrogate reporter. The break will then berepaired by error-prone nonhomologus end-joining (NHEJ), which oftencauses frameshift mutation. Approximately one third of those mutationswill place the Tsi2 coding sequence in frame with Tse2 and thus, allowfor Tsi2 expression which will protect cells from Tse2-induced cellstasis. Such system allows only for the proliferation of thosetransfected cells which have a high likelihood of carrying anuclease-modified genome. In most cases, co-transfection will deliverall three vectors into a single cell. However, to select against cellsnot carrying a surrogate reporter, the surrogate reporter vector may beequipped with an additional selection marker such as e.g. an antibioticresistance gene and cells may be grown under selective pressure.

In a second embodiment, the invention relates to an expression systemcomprising a first and a second expression vector each encoding a TALnuclease cleavage half domain, wherein the first TAL nuclease cleavagehalf domain in the first vector is fused to a Tse2 coding sequence via aself-cleavage site and wherein the second TAL nuclease cleavage halfdomain in the second vector is fused to a Tsi2 coding sequence via aself-cleavage site. One example of such first embodiment is illustratedin FIG. 34B which shows a vector set for coexpression of two TALE FokInuclease cleavage half domains each of which is connected in a separatevector to either Tse2 or Tsi2 via a T2A self cleavage site. In thisexample, both fusion proteins are expressed under the control of a CMVpromoter. However, any other promoter allowing for substantialexpression in a given cell or tissue may be used instead as describedelsewhere herein. As the modification of cells depends on thecoexpression of both TAL nuclease cleavage half domains and the survivalof cells depends on expression of Tsi2, only cells with balanced Tse2and Tsi2 expression levels—which likely exhibit high nucleaseactivity—will survive and grow. The T2A self cleavage site can bereplaced by another cleavage site of similar function. Alternatively,other vector setting may be used which allow for simultaneous expressionof Tse2 with a first TAL nuclease cleavage half domain and simultaneousexpression of Tsi2 with a second TAL nuclease cleavage half domain. Forexample, the Tse2 and Tsi2 sequences may be connected to the first andsecond TAL nuclease cleavage half domains via an IRES, a translationalcoupling sequence or an intein as indicated in FIG. 22. The twodescribed embodiments allow for efficient non-invasive enrichment ofcells expressing functional nuclease dimers without further cellseparation or isolation which will facilitate the use of programmablenucleases in biotechnology and basic research.

Once a cell has been identified which potentially has integrated anucleic acid segment or region into cellular nucleic acid, theintegration site location may be confirmed by any number of methods. Onemethod would be to sequence the integration site to determine whetherthe nucleic acid for which integration was desired is present. Anothermethod is through the use of the polymerase chain reaction (PCR). FIG.21 shows two primer sites. PCR using primers which binding to theindicated sites will generate PCR products of different lengthsdepending on whether nucleic acid in addition to the cellular nucleicacid is present. Such PCR reactions will give one of three results: (1)A relatively short PCR product, indicating that integration has notoccurred at the locus, (2) a relatively long PCR product, indicatingthat integration has occurred at the locus, and (3) both relativelyshort and relatively long PCR products, indicating that the samplecontains a mixture of (1) and (2). This can occur of there is a mixedpopulation of cell (e.g., one cell type were integration has occurredand another cell type where integration has not occur) or the cell iseither haploid or polyploid and integration has not occurred in allcopies of the cellular nucleic acid. One instance of this would be ifthe target nucleic acid is Chlamydomonas reinhardtii chloroplast genome.The chloroplast genome of this organism contains rough 60 copies of thechloroplast chromosome.

Delivery and Transfection

Vectors for TAL delivery. In one aspect, the invention relates to novelvectors for delivery of TAL effectors to host cells. TAL effectors aregenerally delivered to cells in single expression vectors, wherein theTAL binding domain and the effector domain are provided in a singleexpression cassette expressed as a fusion protein from a singlepromoter. However, in embodiments where is desirable for TAL effectorsto dimerize or multimerize to fulfil their effector function (such as,e.g., in the reconstitution of certain nucleases activities, includingFokI or truncated or modified variants thereof), at least a pair ofthese single expression vectors may be delivered at to a given hostcell. Co-delivery of two or more expression vectors may result inunequal uptake of the vectors and unequal expression and thus under- oroverrepresentation of one or more of the interacting domains leading toa loss in enzymatic activity. Co-expression vectors may be used toresolve such issues. Such vectors may be constructed in a manner whichallows for the simultaneous expression of two or more TAL effectordomains from the same vector (e.g., plasmid). In some embodiments,co-expression vector used allow for simultaneous expression of at leastone TAL effector pair from the same vector.

Vector produced by and/or used in the practice of the invention includethose suitable for co-expression of at least two different TAL effectorproteins may include, for example one or more of the followingcomponents: (i) a promoter operatively linked with a first open readingframe encoding a first TAL effector protein or a truncated or modifiedversion thereof, (ii) a second open reading frame encoding a second TALeffector protein or a truncated or modified version thereof and (iii) asequence element operatively linking at least the first and the secondTAL effector open reading frame wherein the second TAL effector openreading frame contains at least one stop codon. These vectors mayfurther comprise at least one second expression cassette encoding aresistance marker. In another aspect of the invention, at least onepromoter of the aforementioned vector may be an algal, mammalian, yeast,bacterial, or plant promoter as disclosed elsewhere herein. In anotheraspect, the aforementioned vector may allow at least expression inmicroalgae and the promoter may be a synthetic promoter active inmicroalgae. In one aspect, the promoter of the aforementioned vector maybe a CMV or EF1-α promoter, a tissue-specific mammalian promoter, orderivatives thereof.

In a first embodiment, the first open reading frame of the co-expressionvector contains a stop codon and the second open reading frame containsa start codon and the sequence element operatively linking at least thefirst and the second TAL effector open reading frame contains aninternal ribosome entry site (IRES) (FIG. 22A). Co-expression expressionvectors containing an internal ribosomal entry site (IRES) element fromthe encephalomyocarditis virus (EMCV) allow for translation of two openreading frames (ORF 1 and ORF 2, respectively) from one message. IRESsare relatively short DNA sequences that can initiate RNA translation ina 5′ cap-independent fashion. Placement of the IRES and a second gene ofinterest (ORF 2) downstream of the first target gene (ORF 1) allowsco-expression of ORF 1 in a cap-dependent manner and ORF 2 in acap-independent fashion, thus facilitating translation of two proteinsfrom one mRNA transcript. In some instances the expression of the secondopen reading frame which is triggered by an IRES or related signal maybe weaker than expression of the first open reading frame. This may beadvantageous where more of the first protein is required or desirable.As an example, the second open reading frame could encode a selectablemarker. In such an instance, stringent selection may be used to identifyvectors and cells with high levels of expression of the first openreading frame.

Co-expression of two genes from the same promoter can also take placewith the utilization of Thosea asigna virus 2A translational cleavagesite or other cleavage sites with similar function. The T2A cleavagesite is ˜20 amino acids long and can be positioned in between the 2 openreading frames. Cotranslational cleavage occurs via a co-translationalribosome skipping mechanism between the C-terminal Glycine and Prolineresidues, leaving 17 residues attached to the end of the start of thesecond open reading frame. Thus, in another embodiment, the first openreading frame of the co-expression vector does not contain a stop codonand the second open reading frame does not contain a start codon and thesequence element operatively linking at least the first and the secondopen reading frame contains a translational cleavage site, such as,e.g., a T2A site (FIG. 22B). Thus, the invention further includescompositions and methods for the production polyproteins which areprocessed to generate two or more polypeptides from a initial translatedproduct.

In a further embodiment, a sequence element operatively linking at leastthe first and the second open reading frame contains a translationalcoupler sequence. Translational coupling is achieved either by placingthe stop codon of the first open reading frame in direct neighborhood ofthe start codon of the second open reading frame (e.g., UGAAUG) orcausing an overlap between the stop codon of the first open readingframe and the start codon of the second open reading frame as, forexample, represented by the sequence (UGAUG). Thus, in some embodiments,the translational coupler sequence may either be UGAAUG or UGAUG (FIG.22C).

In yet another embodiment, the sequence element operatively linking atleast the first and the second open reading frame contains an inteinthat is able to excise itself from the fusion protein. (FIG. 22D).

In an additional embodiment, the first and second open reading framesare located in the same vector at different insertion sites. The twoopen reading frames may be expressed from two separate expressingcassettes each under control of a separate promoter. In one aspect, thetwo separate promoters may be different promoters, such as, e.g., aconstitute and an inducible promoter or a strong and a weak promoter ordifferent combinations thereof. In certain instances at least one of theopen reading frames has been codon-optimized with regard to a targethost. The open reading frames of the vectors of the invention may, e.g.,encode TAL effector nuclease cleavage domains. For example, a first openreading frame may encode a first TAL-FokI nuclease domain and a secondopen reading frame encode a second TAL-FokI nuclease domain. In someembodiments at least one open reading frame may encode a mutated,truncated or modified TAL-FokI nuclease domain. The mutated domain may,e.g., be a Sharkey domain or may carry at least one of the followingmutations: E490K, I538K, H537R, Q486E, I499L, N496D, R487D, N496D,D483R, and/or H537R.

TAL Delivery Systems

An important factor in the administration of polypeptide compounds, suchas TAL effector, is ensuring that the polypeptide has the ability totraverse the plasma membrane of a cell, or the membrane/matrix of anintra-cellular compartment such as the nucleus. Cellular membranes arecomposed of lipid-protein bilayers that are freely permeable to small,nonionic lipophilic compounds and are inherently impermeable to polarcompounds, macromolecules, and therapeutic or diagnostic agents.However, proteins and other compounds such as liposomes have beendescribed, which have the ability to translocate polypeptides such asTAL effectors across a cell membrane. For example, “membranetranslocation polypeptides” have amphiphilic or hydrophobic amino acidsubsequences that have the ability to act as membrane-translocatingcarriers. In one embodiment, homeodomain proteins have the ability totranslocate across cell membranes. Examples of peptide sequences whichcan be linked to a protein, for facilitating uptake of the protein intocells, include, but are not limited to: an 11 amino acid peptide of thetat protein of HIV; a 20 residue peptide sequence which corresponds toamino acids 84-103 of the p16 protein (see Fahraeus et al., CurrentBiology 6:84 (1996)); the third helix of the 60-amino acid longhomeodomain of Antennapedia (Derossi et al., J. Biol. Chem. 269:10444(1994)); the h region of a signal peptide such as the Kaposi fibroblastgrowth factor (K-FGF) h region (Lin et al. “Identification, expression,and immunogenicity of Kaposi's sarcoma-associated herpesvirus-encodedsmall viral capsid antigen”, J. Virol. 1997 April; 71(4):3069-76.) orthe VP22 translocation domain from HSV (Elliot & O'Hare, Cell 88:223-233(1997)). Other suitable chemical moieties that provide enhanced cellularuptake may also be chemically linked to ZFPs. Membrane translocationdomains (i.e., internalization domains) can also be selected fromlibraries of randomized peptide sequences. See, for example, Yeh et al.(2003) Molecular Therapy 7(5):5461, Abstract #1191.

Many toxin molecules also have the ability to transport polypeptidesacross cell membranes. Often, such molecules (called “binary toxins”)are composed of at least two parts: a translocation/binding domain orpolypeptide and a separate toxin domain or polypeptide. Typically, thetranslocation domain or polypeptide binds to a cellular receptor, andthen the toxin is transported into the cell. Several bacterial toxins,including Clostridium perfringens iota toxin, diphtheria toxin (DT),Pseudomonas exotoxin A (PE), pertussis toxin (PT), Bacillus anthracistoxin, and pertussis adenylate cyclase (CYA), have been used to deliverpeptides to the cell cytosol as internal or amino-terminal fusions(Arora et al., J. Biol. Chem., 268:3334-3341 (1993); Perelle et al.,Infect. Immun., 61:5147-5156 (1993); Stennark et al. J. Cell Biol.113:1025-1032 (1991); Donnelly et al., Proc. Natl. Acad. Sci. USA90:3530-3534 (1993); Carbonetti et al., Abstr. Annu. Meet. Am. Soc.Microbiol. 95:295 (1995); Sebo et al. Infect. Immun. 63:3851-3857(1995); Klimpel et al., Proc. Natl. Acad. Sci. USA 89:10277-10281(1992); and Novak et al., J. Biol. Chem. 267:17186-17193 1992)). Suchpeptide sequences can be used to translocate TAL-cleavage domain fusionproteins across a cell membrane. TAL effectors can be conveniently fusedto or derivatized with such sequences. Typically, the translocationsequence is provided as part of a fusion protein. Optionally, a linkercan be used to link the TAL effector and the translocation sequence. Anysuitable linker can be used, e.g., a peptide linker.

TAL effectors can also be introduced into an animal cell, such as amammalian cell, via a liposomes and liposome derivatives such asimmunoliposomes. The term “liposome” refers to vesicles comprised of oneor more concentrically ordered lipid bilayers, which encapsulate anaqueous phase. The aqueous phase typically contains the compound to bedelivered to the cell. Liposome are believed to fuse with the plasmamembrane, thereby releasing the drug into the cytosol. Alternatively,the liposome may be phagocytosed or taken up by the cell in a transportvesicle. Once in the endosome or phagosome, the liposome is believed toeither degrades or fuses with the membrane of the transport vesicle andreleases its contents. When liposomes are endocytosed by a target cell,for example, they become destabilized and release their contents. Thisdestabilization is termed fusogenesis. Dioleoylphosphatidylethanolamine(DOPE) is the basis of many “fusogenic” systems. The invention thusinclude compositions and methods for the use of liposome to deliver TALeffectors to cells.

Conventional viral and non-viral based gene transfer methods can be usedto introduce nucleic acids encoding engineered TAL effectors in animalcells (e.g., mammalian cells) and target tissues. Such methods can alsobe used to administer nucleic acids encoding TAL effectors to cells invitro. In certain embodiments, nucleic acids encoding TAL effectors maybe administered for in vivo or ex vivo gene therapy uses. Non-viralvector delivery systems include DNA plasmids, naked nucleic acid, andnucleic acid complexed with a delivery vehicle such as a liposome orpoloxamer. Viral vector delivery systems include DNA and RNA viruses,which have either episomal or integrated genomes after delivery to thecell.

Methods of non-viral delivery of nucleic acids encoding engineered TALeffectors include electroporation, lipofection, microinjection,biolistics, virosomes, liposomes, immunoliposomes, polycation orlipid:nucleic acid conjugates, naked DNA, artificial virions, andagent-enhanced uptake of DNA. Sonoporation using, e.g., the Sonitron2000 system (Rich-Mar) can also be used for delivery of nucleic acids.Additional exemplary nucleic acid delivery systems include thoseprovided by Amaxa Biosystems (Cologne, Germany), Maxcyte, Inc.(Rockville, Md.) and BTX Molecular Delivery Systems (Holliston, Mass.).

The use of RNA or DNA viral based systems for the delivery of nucleicacids encoding engineered TAL effectors take advantage of highly evolvedprocesses for targeting a virus to specific cells in the body andtrafficking the viral payload to the nucleus. Viral vectors can beadministered directly to patients (in vivo) or they can be used to treatcells in vitro and the modified cells are administered to patients (exvivo). Conventional viral based systems for the delivery of TALeffectors include, but are not limited to, retroviral, lentivirus,adenoviral, adeno-associated, vaccinia and herpes simplex virus vectorsfor gene transfer. Integration in the host genome is possible with theretrovirus, lentivirus, and adeno-associated virus gene transfermethods, often resulting in long term expression of the insertedtransgene. Additionally, high transduction efficiencies have beenobserved in many different cell types and target tissues.

In applications in which transient expression of a TAL effector isdesirable, adenoviral based systems can be used. Adenoviral basedvectors are capable of very high transduction efficiency in many celltypes and do not require cell division. With such vectors, high titerand high levels of expression have been obtained. This vector can beproduced in large quantities in a relatively simple system.Adeno-associated virus (“AAV”) vectors are also used to transduce cellswith target nucleic acids, e.g., in the in vitro production of nucleicacids and peptides, and for in vivo and ex vivo gene therapy procedures(see, e.g., West et al., Virology 160:38-47 (1987); U.S. Pat. No.4,797,368; WO 93/24641; Kotin, Human Gene Therapy 5:793-801 (1994);Muzyczka, J. Clin. Invest. 94:1351 (1994). Construction of recombinantAAV vectors are described in a number of publications, including U.S.Pat. No. 5,173,414; Tratschin et al., Mol. Cell. Biol. 5:3251-3260(1985); Tratschin, et al., Mol. Cell. Biol. 4:2072-2081 (1984); Hermonat& Muzyczka, Proc. Natl. Acad. Sci. USA 81:6466-6470 (1984); and Samulskiet al., J. Virol. 63:03822-3828 (1989).

Replication-deficient recombinant adenoviral vectors (Ad) can beproduced at high titer and readily infect a number of different celltypes. Most adenovirus vectors are engineered such that a transgenereplaces the Ad E1a, E1b, and/or E3 genes; subsequently the replicationdefective vector is propagated in human 293 cells that supply deletedgene function in trans. Ad vectors can transduce multiple types oftissues in vivo, including nondividing, differentiated cells such asthose found in liver, kidney and muscle. Conventional Ad vectors have alarge carrying capacity. An example of the use of an Ad vector in aclinical trial involved polynucleotide therapy for antitumorimmunization with intramuscular injection (Sterman et al., Hum. GeneTher. 7:1083-9 (1998)).

Packaging cells are used to form virus particles that are capable ofinfecting a host cell. Such cells include 293 cells, which packageadenovirus, and psi-2 packaging line (a retroviral packaging linecreated by stably introducing into NIH3T3 cells an engineered retroviralDNA genome from which the RNA packaging signal had been removed) cellsor PA317 cells, which package retrovirus. Viral vectors used in genetherapy are usually generated by a producer cell line that packages anucleic acid vector into a viral particle. The vectors typically containthe minimal viral sequences required for packaging and subsequentintegration into a host (if applicable), other viral sequences beingreplaced by an expression cassette encoding the protein to be expressed.Missing viral functions are supplied in trans by the packaging cellline. The cell line is also infected with adenovirus as a helper. Thehelper virus promotes replication of the AAV vector and expression ofAAV genes from the helper plasmid. In many gene therapy applications, itis desirable that the gene therapy vector be delivered with a highdegree of specificity to a particular tissue type. Accordingly, a viralvector can be modified to have specificity for a given cell type byexpressing a ligand as a fusion protein with a viral coat protein on theouter surface of the virus. Although the above description appliesprimarily to viral vectors, the same principles can be applied tononviral vectors. Such vectors can be engineered to contain specificuptake sequences which favor uptake by specific target cells.

Gene therapy vectors can be delivered in vivo by administration to anindividual patient, typically by systemic administration (e.g.,intravenous, intraperitoneal, intramuscular, subdermal, or intracranialinfusion) or topical application, as described below. Alternatively,vectors can be delivered to cells ex vivo, such as cells explanted froman individual patient (e.g., lymphocytes, bone marrow aspirates, tissuebiopsy) or universal donor hematopoietic stem cells, followed byreimplantation of the cells into a patient, usually after selection forcells which have incorporated the vector.

Ex vivo cell transfection for diagnostics, research, or for gene therapy(e.g., via re-infusion of the transfected cells into the host organism)is well known to those of skill in the art. In some embodiments, cellsare isolated from the subject organism, transfected with a ZFP nucleicacid (gene or cDNA), and re-infused back into the subject organism(e.g., patient). Various cell types suitable for ex vivo transfectionare well known to those of skill in the art (see, e.g., Freshney et al.,Culture of Animal Cells, A Manual of Basic Technique (3rd ed. 1994)).

In one embodiment, stem cells are used in ex vivo procedures for celltransfection and gene therapy. The advantage to using stem cells is thatthey can be differentiated into other cell types in vitro, or can beintroduced into a mammal (such as the donor of the cells) where theywill engraft in the bone marrow. Methods for differentiating CD34+ cellsin vitro into clinically important immune cell types using cytokinessuch a GM-CSF, IFN-.gamma. and TNF-.alpha. are known (see Inaba et al.,J. Exp. Med. 176:1693-1702 (1992)). Stem cells are isolated fortransduction and differentiation using known methods. For example, stemcells are isolated from bone marrow cells by panning the bone marrowcells with antibodies which bind unwanted cells, such as CD4+ and CD8+(T cells), CD45+ (panB cells), GR-1 (granulocytes), and lad(differentiated antigen presenting cells) (see Inaba et al., J. Exp.Med. 176:1693-1702 (1992)).

Vectors (e.g., retroviruses, adenoviruses, liposomes, etc.) containingtherapeutic TAL effector nucleic acids can also be administered directlyto an organism for transduction of cells in vivo. Alternatively, nakedDNA can be administered. Administration is by any of the routes normallyused for introducing a molecule into ultimate contact with blood ortissue cells including, but not limited to, injection, infusion, topicalapplication and electroporation. Suitable methods of administering suchnucleic acids are available and well known to those of skill in the art,and, although more than one route can be used to administer a particularcomposition, a particular route can often provide a more immediate andmore effective reaction than another route.

Control of transient TAL effector expression. As described above, viraland non-viral based gene transfer methods can be used to introducenucleic acids encoding TAL effectors into cells which will then betranscribed and translated by the cellular machinery. Following DNAtransfection, detection of transient expression of the transgenegenerally lasts for 1 to 7 days. Only a fraction of DNA delivered to thecells makes it to the nucleus for transcription, with eventual export ofthe message to the cytoplasm for protein production. Within a few daysmost of the foreign DNA is degraded by nucleases or diluted by celldivision; and after a week, its presence is no longer detected. However,even such short expression time may allow a TAL effector to interactwith other potential genomic binding sites leading to unwantedoff-target site manipulation. To avoid such additional interaction itmay be desired in certain instances, to fine-tune the transientexpression of a TAL effector function or even completely remove a TALeffector from the cell once the intended effect has been achieved. Inprinciple, control of gene expression can be achieved at three differentlevels: at DNA, mRNA and protein level.

In a first embodiment, the activity of a TAL effector may be controlledat protein level by affecting its protein half-life. Long half-lifeproteins are accumulated over a very long period (days), such that anyincrease in production that occurs during a few to several hours hasproportionally little impact on the very high steady-state levelsalready present. To reduce the half-life of a translated TAL effectorprotein, protein-destabilizing elements may be used. For example, thePEST sequence—a sequence rich in proline, glutamic acid, serine andthreonine that acts as a signal peptide for protein degradation—may befused to a TAL effector sequence (Rechsteiner and Rogers. PEST sequencesand regulation by proteolysis. Trends Biochem Sci. 21(7):267-71 (1996)).The PEST sequence is associated with proteins that have a shortintracellular half-life and was shown to efficiently destabilizetransiently expressed reporter proteins when fused to their C-terminus(Li et al. Generation of destabilized green fluorescent protein as atranscription reporter J. Biol. Chem. 27334970-34975 (1998)). Othermethods for destabilizing proteins utilize the N-end rule (Bachmair etal. In vivo half-life of a protein is a function of its amino-terminalresidue Science 234179-186. (1986)) or ubiquitin fusion degradationpathways (Johnson et al. A proteolytic pathway that recognizes ubiquitinas a degradation signal. J. Biol. Chem. 27017442-17456 (1995)). Forexample it has been shown that the degree of destabilization of aprotein can be controlled depending on the number of multimerized linearchains of ubiquitin coupled to the target protein (U.S. Pat. No.7,262,005 incorporated herein by reference in its entirety).Alternatively, recognition sites for cleavage by cellular proteases(such as e.g. serine, threonine, cystein, aspartate or glutamicproteases) may be incorporated into the TAL effector sequence.Destabilizing the translated protein, however, only partly addresses theproblem, since clearance rates are also dependent on the half-life ofthe TAL effector mRNA. As long as the pre-existing mRNA remains intact,it continues to produce new TAL effector proteins via translation.

Thus, in a second embodiment destabilizing elements may alternatively(or in addition) be provided at RNA level. For example, a PCR fragmentor synthetic oligonucleotide containing an AU-rich sequence stretchknown to destabilize cellular RNA may be fused to the 3′-UTR region(Zubiaga et al. The nonamer UUAUUUAUU is the key AU-rich sequence motifthat mediates mRNA degradation Mol. Cell. Biol. 152219-2230. (1995)).RNA-destabilizing elements derived from myc orfos genes may also besuitable for this purpose. (Yeilding et al. Identification of sequencesin c-myc mRNA that regulate its steady-state levels Mol. Cell. Biol.163511-3522. (1996); Shyu et al. The c-fos transcript is targeted forrapid decay by two distinct mRNA degradation pathways Genes Dev. 360-72.(1989)). Alternatively, an artificial intron may be created within thecoding region of a TAL effector sequence defined by a splice donor and asplice acceptor site that will cause splicing of TAL effectortranscripts. The splice donor site includes an almost invariant sequenceGU at the 5′ end of the intron, within a larger, less highly conservedregion. The splice acceptor site at the 3′ end of the intron terminatesthe intron with an almost invariant AG sequence. Upstream from the AGthere is a region high in pyrimidines (C and U), or a polypyrimidinetract that may be created based on the degeneracy of the genetic code.Upstream from the polypyrimidine tract is the branch point, whichincludes an adenine nucleotide. The artificial intron will inducesplicing of at least a portion of the TAL effector transcripts whichwill then be translated into nonsense proteins.

In a third embodiment, TAL effector expression may also be controlledvia temporary gene knockdown by treatment with short DNA or RNAmolecules with a sequence complementary to the TAL effector mRNAtranscript or gene. In a transient knockdown, the binding of acomplementary oligonucleotide to the active TAL effector gene or itstranscripts causes decreased expression through blocking oftranscription (in the case of gene-binding), degradation of the mRNAtranscript (e.g. by small interfering RNA (siRNA)) or blocking mRNAtranslation. An siRNA sequence targeting the TAL effector sequence mayfor example be delivered to the target host cell in a separate vector ormay be provided with the vector containing the TAL effector, e.g. in thecontext of a separate expression cassette the promoter of which may beinducible.

Yet another possibility to fine-tune TAL effector expression in a givenhost cell is the introduction of a TAL effector binding site into theTAL effector coding sequence. If substantial amounts of TAL effectorprotein have been produced in a cell, the TAL effector will bind to itsown DNA thereby inhibiting further transcription of the gene. Suchnegative feedback regulation has the advantage that expression controldepends on the amount of functional protein in the cell. For example, ifthe TAL effector is a TAL nuclease, binding of the TAL nuclease to itsown DNA will result in double strand breaks of the TAL nuclease encodingDNA. If the TAL effector is a repressor, binding of the TAL repressor toa TAL binding site inserted close to or overlapping the promoter region,may interfere with RNA polymerase's progress along the strand, thusimpeding the expression of the gene. Thus, the invention relates, inpart, to a TAL effector coding expression cassette comprising at leastone target sequence for the TAL effector protein encoded by theexpression cassette that allows binding of the TAL effector proteinthereby interfering with TAL effector expression.

TAL Nucleic Acid Scaffolds. As described elsewhere herein in moredetail, TAL effectors can be fused to functional domains such asnucleases, activators, repressors or epigenetic modifiers therebylinking their inherent nucleic acid binding specificity to anothernucleic acid binding or nucleic acid modifying activity. However, in oneinstance of the invention specific binding of TAL effectors to targetnucleic acid can be used to arrange fused effector functions inpredefined order on a nucleic acid scaffold designed to carry multipleTAL binding sites. A related approach was described by Conrado et al.(“DNA-guided assembly of biosynthetic pathways promotes improvedcatalytic efficiency. Nucleic Acids Res. 2012 Feb. 1; 40(4):1879-1889)for zinc-finger enzyme fusion proteins. By TAL nucleic acid (e.g., DNA)scaffold as used herein is meant a system comprising at least a nucleicacid scaffold with one or more TAL effector binding sites that can bebound by one or more engineered TAL effector fusions. In one embodimentof the invention TAL effectors may be fused to enzymes catalyzingreactions of a metabolic pathway to efficiently accumulate theseenzymatic functions on a nucleic acid scaffold in predefined order. Suchorganized enzyme assembly may be used to increase or accelerate turnover of existing metabolic pathways or may be used to establish newbiosynthetic pathways in a given host. In another embodiment, TALeffectors may be fused to signaling molecules to trigger signalingpathways or construct artificial communication or gene regulatorynetworks in a given host for applications in gene therapy, tissueengineering, biotechnology etc.

Thus, the invention also relates, in part, to TAL effector fusionsorganized on nucleic acid scaffolds. In one aspect TAL nucleic acidscaffolds are designed to harbor multiple target binding sites toassemble different TAL effector fusions. TAL effector binding sites maybe located in one strand of the nucleic acid scaffold or may also belocated in the opposite strand. Specific binding sites for different TALeffector fusions may be separated by spacers of same or differentlength. Spacer length between the binding sites determines the proximityof the bound fusion proteins on the nucleic acid scaffold and maycritically influence protein interaction. In some embodiments, thespacers between two binding sites on the same nucleic acid strand maycomprise between 2 and 5, between 4 and 10, between 6 and 20, between 15and 30 nucleotides. In certain instances, nucleic acid scaffolds of theinvention may only carry one unique binding site for each TAL effectorfusion. In other instances, nucleic acid scaffolds of the invention maycarry several copies of binding sites for a specific TAL effectorfusion. The number of binding sites for different TAL effector fusionsincluded in a nucleic acid scaffold may be equal or different. Forexample, the nucleic acid scaffold may contain one binding site for afirst TAL effector, two binding sites for a second TAL effector, threeor four binding sites for a third TAL effector etc. or the nucleic acidscaffold may contain two copies of a first binding site, one copy of asecond binding site, two or four or more copies of a third binding siteetc. The nucleic acid scaffold may for example consist of several unitswherein one unit contains different binding sites for different TALeffectors and the nucleic acid scaffold contains many copies of theentire unit. For example, the nucleic acid scaffold may comprise TALbinding sites for TAL effectors 1, 2 and 3 in one unit and the nucleicacid scaffold may comprise 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,15, 16, 17, 18, 19, 20 or more copies of this unit. The inventionincludes all combinations of repeats and/or ratios of binding sites orbinding site units in a nucleic acid scaffold depending on the requiredconcentration or activity of the binding TAL effector function. Theorder of binding sites for different TAL effectors in a nucleic acidscaffold may also vary in different embodiments of the invention. Forexample in a first embodiment one or several copies of a first TALeffector binding site in a nucleic acid scaffold may be followed by oneor several copies of a second TAL effector binding site. In anotherembodiment several copies of a first TAL effector binding site may beinterrupted by one or more copies of other TAL effector binding sites.The invention therefore includes all orders of single or multiple TALeffector binding sites in a nucleic acid scaffold depending on therequired order of reactions or interactions mediated by same ordifferent TAL effector functions. The TAL effector binding sites fordifferent TAL effector fusions may have equal or different lengths. Forexample, the binding site for a first TAL effector may consist of 19nucleotides whereas the binding site for another TAL effector mayconsist of 25 nucleotides, etc.

The invention further relates to methods of assembling nucleic acidscaffolds with multiple TAL binding sites. In some instances, nucleicacid scaffolds may be generated based on plasmid nucleic acid orvectors. For example, a series of TAL binding sites may be inserted intothe multiple cleavage site of a plasmid or vector. Individual TALbinding sites maybe flanked by restriction enzyme cleavage sites toallow for insertion, deletion or replacement of binding sites. In otherinstances, nucleic acid scaffolds may be synthesized de novo fromoverlapping oligonucleotides, predesigned parts and/or PCR-basedtechniques as described elsewhere herein.

In one aspect, TAL nucleic acid scaffolds of the invention are designedto assemble TAL effector enzyme fusions. The coding region of an enzymeor enzymatic domain may be fused either 5′ or 3′ to the TAL effectorsequence depending on the structural requirements and/or accessibilityof the enzymatic domain. Furthermore, a linker sequence such as, e.g., asequence encoding a Gly-Ser linker may be included to separate the TALeffector from the enzymatic domain. In certain instances, the TALeffector fusions may be provided on a support as illustrated in FIG.23A. The TAL effector enzyme fusions may harbor activities catalyzingdefined steps of a metabolic pathway. By fusing pathway-associatedenzymatic functions to TAL effectors which are organized to bind to anucleic acid scaffold in a predicted order, the enzymatic functions canbe concentrated in a defined region or compartment of a host therebyincreasing pathway flux and turn-over rates of metabolic products. TALnucleic acid scaffolds of the invention may be used, e.g., to establishnon-native pathways in a given host. Any host or host cell includingbacteria, yeast, fungi, plant, insect or mammalian cells suitable forgenetic engineering may be used for this purpose. Thus, the inventionrelates in part to an engineered cell containing (i) at least onenucleic acid molecule with two or more distinct TAL effector bindingsites, (ii) one or more nucleic acid sequences encoding at least a firstand a second TAL effector fusion with enzymatic or signaling activitywherein the one or more TAL effector fusions are expressed in the celland bind to the predicted target binding sites on the at least onenucleic acid molecule. The two or more TAL effector binding sites on theat least one nucleic acid molecule may be present at multiple copiesand/or at different stoichiometric ratios.

Furthermore, the TAL nucleic acid scaffold may be episomal, stablyintegrated into the genome of said engineered cell or attached to thegenome of said engineered cell, e.g., using scaffold matrix attachmentregions. The engineered cell may be any cell including an algae ormicroalgae cell. In one aspect, the engineered cell is a microalgae andthe TAL nucleic acid scaffold may be integrated into or attached to thenuclear or the chloroplast genome of said cell.

In one aspect, the nucleic acid sequences encoding the TAL effector, thelinker and/or the fused enzymatic or signaling domain may be codonoptimized with regard to the host cell. Different optimizationstrategies or parameters that may be taken into account are described inmore detail elsewhere herein.

In one instance, TAL effector nucleic acid scaffolds may be used toengineer artificial pathways in plants or algae. Algae suitable for usein the present invention encompass both prokaryotic and eukaryoticalgae, and in particular unicellular algae also known as microalgae.Non-limiting examples of microalgae that may be used to establish TALeffector nucleic acid scaffolds include Chlamydomonas reinhardtii,Leptplyngbya, Synechococcus elongates, diatoms, Phaeodactylumtricornutum, Thalassiosira pseudonana, Cyanidioschyzon merolae,Ostreococcus lucimarinus, Ostreococcus tauri, Micromonas pusilla,Fragilariopsis cylindrus, Pseudo-nitzschia, Thalassiosira rotula,Botryococcus braunii, Chlorella vulgaris, Dunaliella salina, Micromonaspusilla, Galdieria sulphuraria, Porphyra purpurea, Volvox carteri orAureococcus anophageferrens. Microalgae systems provide rapid growthrates and inexpensive growth conditions and have the ability to productlipids and store significant amounts of energy-rich compounds such astriacylglycerides or starch making them an attractive source forproduction of biofuels such as biodiesel, green diesel, green gasoline,or green jet fuel. Thus, in one aspect the invention relates to TALnucleic acid scaffolds for assembly of enzymatic activities involved inbiofuel production. The enzymatic activities may be derived fromdifferent sources and may be, e.g., of bacterial, plant or yeast origin.

In one example illustrated by FIG. 23, TAL nucleic acid scaffolds aredesigned for assembly of a pathway producing 2,3-butanediol inmicroalgae. 2,3-butanediol may be synthesized using enzymes ofheterologous source (e.g., bacterial enzymes) in algae. In a firstpathway, pyruvate is turned into acetolactate by E. coli acetolatesynthase, an enzyme consisting of a large subunit (encoded by gene ilvI)and an isozyme III small subunit (encoded by gene ilvH). In a next stepacetolactate is turned into acetoin by B. subtilis by acetolactatedecarboxylase encoded by gene alsD. Finally acetoin is turned into 2,3-butanediol by B. subtilis acetoin reductase/2,3-butanedioldehydrogenase encoded by gene ydjL.

In an alternative pathway pyruvate is first turned into acetyl-CoA by E.coli pyruvate dehydrogenase encoded by gene pdh. Then acetyl-CoA isturned into acetoin and acetaldehyde by the concerted action of the E1α,E1β, E2, and E3 subunits of the acetoin dehydrogenase complex encoded bythe acoABCL operon. The final step turning acetoin into 2,3-butanediolis again catalyzed by B. subtilis ydjL gene product (see FIG. 23B).

A possible arrangement of the TAL effector fusions on TAL nucleic acidscaffolds for assembly of the two described 2,3-butanediol pathways inmicroalgae is shown in FIG. 23C. The bacterial genes are fused to TALeffectors with different nucleic acid binding specificities. Thebacterial genes may require codon optimization to achieve optimalexpression rates in microalgae as described elsewhere herein and may beseparated by a flexible linker from the TAL effector. The TAL effectorfusion sequences are then cloned into one or more suitable vectors(e.g., a functional vector as described elsewhere herein) allowing forefficient expression of the fusion proteins in the target host.Depending on the length of the coding sequences and the amount of TALeffector fusions required to establish a given pathway, the TAL effectorfusion sequences may be cloned into one single vector or may be providedin different vectors. The TAL effector fusion encoding vectors are thendelivered to the host by methods known in the art. They may be deliveredtogether with the TAL nucleic acid scaffolds carrying the TAL effectorbinding sites or may be delivered into host cells which have alreadybeen engineered to contain the TAL nucleic acid scaffolds (e.g., bystabile genomic integration). In somes instances constitutive expressionof the TAL effector fusions may be required whereas in other instancesthe expression of some or all TAL effector fusions may be inducible.Examples of inducible promoters that can be used in algae includecopper-responsive elements or nitrate-responsive promoters. Furthermore,enzymatic activities from different pathways may be assembled orcombined in the same host cell to achieve optimal titers of a requiredintermediate or end product.

To engineer pathways for biofuel production in algae, TAL effectors orTAL effector fusions with repressor or cleavage activity may further beused to induce gene knock-down or block metabolic pathways that lead tothe accumulation of energy-rich storage compounds such as starch ordecrease lipid catabolism to increase lipid accumulation in cells. Insome instances it may be required to avoid one specific reactioncatalyzed by a given enzyme whereas another reaction catalyzed by thesame enzyme may be essential for cell survival. In such case, the enzymemay be knocked-down using a TAL effector fusion and may be replaced byan engineered enzyme or combination of enzymatic activities catalyzingonly the desired reaction. The engineered enzymes may be provided via aTAL nucleic acid scaffold as described above. Thus, TAL effectors andTAL nucleic acid scaffolds of the invention can be used to specificallyengineer hosts for improved or modified metabolic pathways.

TALs for Tethering Polymerase to DNA Templates

Single-stranded template DNA is sometimes prepared for DNA sequencing byattachment to beads by using emulsion PCR (e.g., 454, SOLID™, and IONTORRENT™ Semiconductor Sequencing). This method creates a population ofclonal single stranded DNA covalently attached to a bead. In somesequencing methods (e.g., 454 and ION TORRENT™ semiconductor sequencing)a primer is annealed to the single stranded DNA templates to form asuitable substrate for a DNA dependent DNA polymerase to performnucleotide addition. It is important that every primer DNA on the beadis bound to a functional polymerase and that all of the DNA substrateson the bead are extended synchronously. If a DNA polymerase dissociatesfrom the DNA (and the bead) the signal will be reduced because that DNAis no longer being extended. Furthermore, if a new polymerase moleculethen associates with the free 3′ substrate after the rest of thepopulation on the bead is extended by at least one base, its extensionwill be asynchronous with the rest of the DNA population on the bead.Its extension products will then contribute noise to the signal. It istherefore normally important that DNA polymerases in these sequencingapplications bind strongly to their target DNAs and remain bound throughhundreds of nucleotide incorporation cycles. Thus, there is a need forimproving the efficiency at which the polymerase functions whileextending the template on a particular bead.

By tethering the polymerase to the short template, the polymerase wouldbe restricted from diffusion, leading to more favorable kinetics forinitiation of transcription. To solve this problem, the inventors havedesigned a TAL effector fused to a polymerase which efficiently binds toa double stranded target binding site at the end of a template therebytethering the polymerase to its template DNA. The TAL polymerase fusionprotein was produced by a two-step assembly process as described indetail elsewhere herein. The TAL effector domain was designed to bind tothe double stranded DNA formed by annealing a primer to the singlestranded DNA templates on a bead. For family A polymerases it isgenerally desirable (but not essential) that the TAL is fused to theamino terminus of the polymerase using a short linker sequence; forfamily B polymerases the carboxyl terminus may be desirable. Correctorientation and flexibility between the TAL effector and polymerasedomains are important to allow independent folding of both domains toensure efficient substrate binding and polymerase function at the sametime. TAL polymerase fusion proteins can be expressed and purified byconventional methodologies well know to those skilled in the art. Thepurified TAL polymerase fusion proteins will bind to DNA templates on abead with higher avidity than the polymerase alone. In the case of anamino fusion with a family A polymerase the TAL effector binds to thedouble stranded DNA formed by the primer annealing to the template andthe polymerase binds to the free 3′ end of the primer. The polymerase isfreely capable of performing multiple nucleotide additions. Duringsequencing, the newly forming double stranded DNA forms a loop, but thesubstrate remains bound at two locations; the TAL effector moietyremains bound to the primer domain while the polymerase remains on theextending 3′ end. If the polymerase dissociates from the substrate, theTAL prevents the polymerase from diffusing away. Because it is bound andlocalized, the polymerase has a greatly increased opportunity to rebindto the appropriate 3′ end of its substrate and continue synchronousnucleotide synthesis.

Thus, the invention relates, in part, to a TAL polymerase fusionprotein. In a first embodiment, the TAL effector binding domain is fusedto the amino-terminal end of the polymerase domain. In a secondembodiment the TAL binding domain is fused to the carboxyl-terminal endof the polymerase. In certain instances, the TAL and polymerase domainsmay be separated by a flexible peptide linker sequence such as, e.g., aglycine-serine linker. For some applications, the TAL polymerase fusionprotein may be equipped with a tag for purification or detectionpurposes as described in more detail elsewhere herein. The TAL effectormoiety may contain at least six (e.g., at least 8, at least 10, at least12, at least 15, at least 17, from about 6 to about 25, from about 6 toabout 35, from about 8 to about 25, from about 10 to about 25, fromabout 12 to about 25, from about 8 to about 22, from about 10 to about22, from about 12 to about 22, from about 6 to about 20, from about 8 toabout 20, from about 10 to about 22, from about 12 to about 20, fromabout 6 to about 18, from about 10 to about 18, from about 12 to about18, etc.) TAL repeats. In some instances, the TAL effector moiety maycontain 18 or 24 or 17.5 or 23.5 TAL nucleic acid binding cassettes. Inadditional instances, a TAL effector fused to a polymerase may contain15.5, 16.5, 18.5, 19.5, 20.5, 21.5, 22.5 or 24.5 TAL nucleic acidbinding cassettes.

The polymerase fused to the TAL effector may be any DNA polymerase knownin the art that is capable of synthesizing single stranded template DNA.For example, the polymerase may be Thermus aquaticus DNA polymerase(Taq), Thermus filiformis (Tfi) DNA polymerase; Thermococcus zilligi(Tzi) DNA polymerase, Thermus thermophilus (Tth) DNA polymerase, Thermusflavus (Tfl) DNA polymerase, Pyrococcus woesei (Pwo) DNA polymerase,Pyrococcus furiosus (Pfu) DNA polymerase, Turbo Pfu DNA polymerase,Thermococcus litoralis (Tli) DNA polymerase, Vent DNA polymerase,Pyrococcus sp. GB-D polymerase, Thermotoga maritima (Tma) DNApolymerase, Bacillus stearothermophilus (Bst) DNA polymerase, Pyrococcuskodakaraensis (KOD) DNA polymerase, Pfx DNA polymerase, Thermococcus sp.JDF-3 (JDF-3) DNA polymerase, Thermococcus gorgonarius (Tgo) DNApolymerase, Thermococcus acidophilium DNA polymerase, Sulfolobusacidocaldarius DNA polymerase, Thermococcus sp. 9 deg. N-7 DNApolymerase, Thermococcus sp. NA1; Pyrodictium occultum DNA polymerase,Methanococcus voltae DNA polymerase, Methanococcus thermoautotrophicumDNA polymerase, Methanococcus jannaschii DNA polymerase, Desulfurococcusstrain TOK DNA polymerase (D. Tok Pol), Pyrococcus abyssi DNApolymerase, Pyrococcus horikoshii DNA polymerase, Pyrococcus islandicumDNA polymerase, Thermococcus fumicolans DNA polymerase, Aeropyrum pernixDNA polymerase, or heterodimeric DNA polymerase DP1/DP. In someembodiments, the DNA polymerase may be a polymerase such as Deep VentDNA polymerase (New England Biolabs), AMPLITAQ GOLD® DNA polymerase(Applied Biosciences), Stoffel fragment of AMPLITAQ® DNA Polymerase(Roche), KOD polymerase (EMD Biosciences), Klentaql polymerase (DNAPolymerase Technology, Inc), OMNI KLENTAQ™ DNA polymerase (DNAPolymerase Technology, Inc), OMNI KLENTAQ™ LA DNA polymerase (DNAPolymerase Technology, Inc), PHUSION® High Fidelity DNA polymerase (NewEngland Biolabs), HEMO KLENTAQ™ (New England Biolabs), PLATINUM® Taq DNAPolymerase High Fidelity (Life Technologies), PLATINUM® Pfx (LifeTechnologies), ACCUPRIME™ Pfx (Life Technologies), or ACCUPRIME™ Taq DNAPolymerase High Fidelity (Life Technologies).

The polymerase fused to a TAL effector according to the invention can beany Family A DNA polymerase (also known as pol I family) or any Family BDNA polymerase. In some embodiments, the DNA polymerase can be arecombinant form capable of extending target-specific primers withsuperior accuracy and yield as compared to a non-recombinant DNApolymerase. For example, the polymerase can include one of the abovelisted high-fidelity polymerase or thermostable polymerase.

One example of a TAL polymerase fusion protein according to theinvention is shown in FIG. 35. This example shows a TAL binding domainfused to the aminoterminal end of a Bst DNA polymerase via linkersequence GGGVTM (SEQ ID NO: 89). The amino acid sequence of thisTAL-Bst1.0 polymerase fusion protein is referred to as SEQ ID NO: 82.Thus, in one specific embodiment the invention relates to a TALpolymerase fusion protein comprising the sequence of SEQ ID NO: 82 andits coding nucleic acid sequence which may be either a wildtype orcodon-optimized sequence.

In another aspect, the invention relates to a method for tethering a DNApolymerase to a template DNA by using a TAL polymerase fusion protein asdescribed above. In particular, the invention relates to a method fortethering a DNA polymerase to a primer for extension of singlestrandedDNA templates in preparation of sequencing reactions (such as “emulsionPCR”), wherein the DNA polymerase is fused to a TAL effector domain asdescribed above.

Furthermore, the invention also relates, in part, to the use of a TALpolymerase fusion protein in PCR amplification reactions, wherein theDNA template which is to be amplified by the polymerase portion of theTAL polymerase fusion protein is coupled to a bead.

The invention is further described by the following vector sequences:

pENTR221 Truncated TAL MCS Vector Sequence SEQ ID NO: 30    1ctttcctgcg ttatcccctg attctgtgga taaccgtatt accgcctttg agtgagctga   61taccgctcgc cgcagccgaa cgaccgagcg cagcgagtca gtgagcgagg aagcggaaga  121gcgcccaata cgcaaaccgc ctctccccgc gcgttggccg attcattaat gcagctggca  181cgacaggttt cccgactgga aagcgggcag tgagcgcaac gcaattaata cgcgtaccgc  241tagccaggaa gagtttgtag aaacgcaaaa aggccatccg tcaggatggc cttctgctta  301gtttgatgcc tggcagttta tggcgggcgt cctgcccgcc accctccggg ccgttgcttc  361acaacgttca aatccgctcc cggcggattt gtcctactca ggagagcgtt caccgacaaa  421caacagataa aacgaaaggc ccagtcttcc gactgagcct ttcgttttat ttgatgcctg  481gcagttccct actctcgcgt taacgctagc atggatgttt tcccagtcac gacgttgtaa  541aacgacggcc agtcttaagc tcgggcccca aataatgatt ttattttgac tgatagtgac  601ctgttcgttg caacaaattg atgagcaatg cttttttata atgccaactt tgtacaaaaa  661agcaggctgc ggccgcgcca ccatgggaaa acctattcct aatcctctgc tgggcctgga  721ttctaccgga ggcatggccc ctaagaaaaa gcggaaggtg gacggcggag tggacctgag  781aacactggga tattctcagc agcagcagga gaagatcaag cccaaggtga gatccacagt  841ggcccagcac cacgaagccc tggtgggaca cggatttaca cacgcccaca ttgtggccct  901gtctcagcac cctgccgccc tgggaacagt ggccgtgaaa tatcaggata tgattgccgc  961cctgcctgag gccacacacg aagccattgt gggagtggga aaacagtggt ctggagccag 1021agccctggaa gccctgctga cagtggccgg agaactgaga ggacctcctc tgcagctgga 1081tacaggacag ctgctgaaga ttgccaaaag gggcggagtg accgcggtgg aagccgtgca 1141cgcctggaga aatgccctga caggagcccc tctgaaccct tgcaggtgcc ggaattgcca 1201gctggggcgc cctctggtaa ggttgggaag ccctgcaaag taaactggat ggctttcttg 1261ccgccaagga tctgatggcg caggggatca agctctgatc aagagacagg atgaggatcg 1321tttcgcatgc agttcaaagt gtatacctac aaacgtgaaa gccgttatcg tctgtttgtg 1381gatgtgcaga gcgatattat tgataccccg ggtcgtcgta tggtgattcc gctggcctct 1441gcgcgtctgc tgtctgataa agtgagccgt gagctgtatc cggtggtgca tattggtgat 1501gaaagctggc gtatgatgac caccgatatg gcgagcgtgc cggtgagcgt gattggcgaa 1561gaagtggcgg atctgagcca tcgtgaaaac gatatcaaaa acgcgattaa cctgatgttt 1621tggggcattt aataaatgtc aggctccctt atacacagcc agtctgcagt cacctgcgga 1681tagcattgtg gcccagctgt ctagacctga tcctgccctg gccgccctga caaatgatca 1741cctggtggcc ctggcctgtc tgggaggcag acctgccctg gatgccgtga aaaaaggact 1801gcctcacgcc cctgccctga tcaagagaac aaatagaaga atccccgagc ggacctctca 1861cagagtggcc gatcacgccc aggtggtgag agtgctggga ttttttcagt gtcactctca 1921ccctgcccag gcctttgatg atgccatgac acagtttggc atgagcagac acggactgct 1981gcagctgttt agaagagtgg gagtgacaga actggaggcc agatccggaa ccctgcctcc 2041tgcctctcag agatgggata ggattctgca gggttcccgt ttaaacaagc ttgtcgacgg 2101taccgaattc atcgatagta ctctcgaggg atccgagctc aagatcttag ctaagtagac 2161ccagctttct tgtacaaagt tggcattata agaaagcatt gcttatcaat ttgttgcaac 2221gaacaggtca ctatcagtca aaataaaatc attatttgcc atccagctga tatcccctat 2281agtgagtcgt attacatggt catagctgtt tcctggcagc tctggcccgt gtctcaaaat 2341ctctgatgtt acattgcaca agataaaaat atatcatcat gaacaataaa actgtctgct 2401tacataaaca gtaatacaag gggtgttatg agccatattc aacgggaaac gtcgaggccg 2461cgattaaatt ccaacatgga tgctgattta tatgggtata aatgggctcg cgataatgtc 2521gggcaatcag gtgcgacaat ctatcgcttg tatgggaagc ccgatgcgcc agagttgttt 2581ctgaaacatg gcaaaggtag cgttgccaat gatgttacag atgagatggt cagactaaac 2641tggctgacgg aatttatgcc tcttccgacc atcaagcatt ttatccgtac tcctgatgat 2701gcatggttac tcaccactgc gatccccgga aaaacagcat tccaggtatt agaagaatat 2761cctgattcag gtgaaaatat tgttgatgcg ctggcagtgt tcctgcgccg gttgcattcg 2821attcctgttt gtaattgtcc ttttaacagc gatcgcgtat ttcgtctcgc tcaggcgcaa 2881tcacgaatga ataacggttt ggttgatgcg agtgattttg atgacgagcg taatggctgg 2941cctgttgaac aagtctggaa agaaatgcat aaacttttgc cattctcacc ggattcagtc 3001gtcactcatg gtgatttctc acttgataac cttatttttg acgaggggaa attaataggt 3061tgtattgatg ttggacgagt cggaatcgca gaccgatacc aggatcttgc catcctatgg 3121aactgcctcg gtgagttttc tccttcatta cagaaacggc tttttcaaaa atatggtatt 3181gataatcctg atatgaataa attgcagttt catttgatgc tcgatgagtt tttctaatca 3241gaattggtta attggttgta acactggcag agcattacgc tgacttgacg ggacggcgca 3301agctcatgac caaaatccct taacgtgagt tacgcgtcgt tccactgagc gtcagacccc 3361gtagaaaaga tcaaaggatc ttcttgagat cctttttttc tgcgcgtaat ctgctgcttg 3421caaacaaaaa aaccaccgct accagcggtg gtttgtttgc cggatcaaga gctaccaact 3481ctttttccga aggtaactgg cttcagcaga gcgcagatac caaatactgt ccttctagtg 3541tagccgtagt taggccacca cttcaagaac tctgtagcac cgcctacata cctcgctctg 3601ctaatcctgt taccagtggc tgctgccagt ggcgataagt cgtgtcttac cgggttggac 3661tcaagacgat agttaccgga taaggcgcag cggtcgggct gaacgggggg ttcgtgcaca 3721cagcccagct tggagcgaac gacctacacc gaactgagat acctacagcg tgagcattga 3781gaaagcgcca cgcttcccga agggagaaag gcggacaggt atccggtaag cggcagggtc 3841ggaacaggag agcgcacgag ggagcttcca gggggaaacg cctggtatct ttatagtcct 3901gtcgggtttc gccacctctg acttgagcgt cgatttttgt gatgctcgtc aggggggcgg 3961agcctatgga aaaacgccag caacgcggcc tttttacggt tcctggcctt ttgctggcct 4021tttgctcaca tgtt pENTR221 Truncated TAL FokI Vector SequenceSEQ ID NO: 31    1ctttcctgcg ttatcccctg attctgtgga taaccgtatt accgcctttg agtgagctga   61taccgctcgc cgcagccgaa cgaccgagcg cagcgagtca gtgagcgagg aagcggaaga  121gcgcccaata cgcaaaccgc ctctccccgc gcgttggccg attcattaat gcagctggca  181cgacaggttt cccgactgga aagcgggcag tgagcgcaac gcaattaata cgcgtaccgc  241tagccaggaa gagtttgtag aaacgcaaaa aggccatccg tcaggatggc cttctgctta  301gtttgatgcc tggcagttta tggcgggcgt cctgcccgcc accctccggg ccgttgcttc  361acaacgttca aatccgctcc cggcggattt gtcctactca ggagagcgtt caccgacaaa  421caacagataa aacgaaaggc ccagtcttcc gactgagcct ttcgttttat ttgatgcctg  481gcagttccct actctcgcgt taacgctagc atggatgttt tcccagtcac gacgttgtaa  541aacgacggcc agtcttaagc tcgggcccca aataatgatt ttattttgac tgatagtgac  601ctgttcgttg caacaaattg atgagcaatg cttttttata atgccaactt tgtacaaaaa  661agcaggctgc ggccgcgcca ccatgggaaa acctattcct aatcctctgc tgggcctgga  721ttctaccgga ggcatggccc ctaagaaaaa gcggaaggtg gacggcggag tggacctgag  781aacactggga tattctcagc agcagcagga gaagatcaag cccaaggtga gatctacagt  841ggcccagcac cacgaagccc tggtgggaca cggatttaca cacgcccaca ttgtggccct  901gtctcagcac cctgccgccc tgggaacagt ggccgtgaaa tatcaggata tgattgccgc  961cctgcctgag gccacacacg aagccattgt gggagtggga aaacagtggt ctggagccag 1021agccctggaa gccctgctga cagtggccgg agaactgaga ggacctcctc tgcagctgga 1081tacaggacag ctgctgaaga ttgccaaaag gggcggagtg accgcggtgg aagccgtgca 1141cgcctggaga aatgccctga caggagcccc tctgaaccct tgcaggtgcc ggaattgcca 1201gctggggcgc cctctggtaa ggttgggaag ccctgcaaag taaactggat ggctttcttg 1261ccgccaagga tctgatggcg caggggatca agctctgatc aagagacagg atgaggatcg 1321tttcgcatgc agttcaaagt gtatacctac aaacgtgaaa gccgttatcg tctgtttgtg 1381gatgtgcaga gcgatattat tgataccccg ggtcgtcgta tggtgattcc gctggcctct 1441gcgcgtctgc tgtctgataa agtgagccgt gagctgtatc cggtggtgca tattggtgat 1501gaaagctggc gtatgatgac caccgatatg gcgagcgtgc cggtgagcgt gattggcgaa 1561gaagtggcgg atctgagcca tcgtgaaaac gatatcaaaa acgcgattaa cctgatgttt 1621tggggcattt aataaatgtc aggctccctt atacacagcc agtctgcagt cacctgcgga 1681tagcattgtg gcccagctgt ctagacctga tcctgccctg gccgccctga caaatgatca 1741cctggtggcc ctggcctgtc tgggaggcag acctgccctg gatgccgtga aaaaaggact 1801gcctcacgcc cctgccctga tcaagagaac aaatagaaga atccccgagc ggacctctca 1861cagagtggcc ggatcccagc tggtgaaatc tgagctggag gagaagaagt ctgagctgag 1921acacaagctg aagtacgtgc ctcacgagta catcgagctg atcgagatcg ccagaaatag 1981cacccaggat agaatcctgg agatgaaggt gatggagttc ttcatgaagg tgtacggcta 2041cagaggaaag cacctgggag gaagcagaaa acctgacgga gccatttata cagtgggcag 2101ccctatcgat tatggcgtga tcgtggatac aaaggcctac agcggaggct acaatctgcc 2161tattggacag gccgatgaga tgcagagata cgtggaggag aaccagacca ggaacaagca 2221catcaaccct aacgagtggt ggaaggtgta cccttctagc gtgaccgagt tcaagttcct 2281gtttgtgagc ggccacttca agggcaatta taaggcccag ctgaccaggc tgaaccacat 2341cacaaattgt aatggcgccg tgctgtctgt ggaggaactg ctgattggag gagagatgat 2401taaggccgga acactgacac tggaggaggt gagaagaaag ttcaacaacg gcgagatcaa 2461cttctgaaag cttacccagc tttcttgtac aaagttggca ttataagaaa gcattgctta 2521tcaatttgtt gcaacgaaca ggtcactatc agtcaaaata aaatcattat ttgccatcca 2581gctgatatcc cctatagtga gtcgtattac atggtcatag ctgtttcctg gcagctctgg 2641cccgtgtctc aaaatctctg atgttacatt gcacaagata aaaatatatc atcatgaaca 2701ataaaactgt ctgcttacat aaacagtaat acaaggggtg ttatgagcca tattcaacgg 2761gaaacgtcga ggccgcgatt aaattccaac atggatgctg atttatatgg gtataaatgg 2821gctcgcgata atgtcgggca atcaggtgcg acaatctatc gcttgtatgg gaagcccgat 2881gcgccagagt tgtttctgaa acatggcaaa ggtagcgttg ccaatgatgt tacagatgag 2941atggtcagac taaactggct gacggaattt atgcctcttc cgaccatcaa gcattttatc 3001cgtactcctg atgatgcatg gttactcacc actgcgatcc ccggaaaaac agcattccag 3061gtattagaag aatatcctga ttcaggtgaa aatattgttg atgcgctggc agtgttcctg 3121cgccggttgc attcgattcc tgtttgtaat tgtcctttta acagcgatcg cgtatttcgt 3181ctcgctcagg cgcaatcacg aatgaataac ggtttggttg atgcgagtga ttttgatgac 3241gagcgtaatg gctggcctgt tgaacaagtc tggaaagaaa tgcataaact tttgccattc 3301tcaccggatt cagtcgtcac tcatggtgat ttctcacttg ataaccttat ttttgacgag 3361gggaaattaa taggttgtat tgatgttgga cgagtcggaa tcgcagaccg ataccaggat 3421cttgccatcc tatggaactg cctcggtgag ttttctcctt cattacagaa acggcttttt 3481caaaaatatg gtattgataa tcctgatatg aataaattgc agtttcattt gatgctcgat 3541gagtttttct aatcagaatt ggttaattgg ttgtaacact ggcagagcat tacgctgact 3601tgacgggacg gcgcaagctc atgaccaaaa tcccttaacg tgagttacgc gtcgttccac 3661tgagcgtcag accccgtaga aaagatcaaa ggatcttctt gagatccttt ttttctgcgc 3721gtaatctgct gcttgcaaac aaaaaaacca ccgctaccag cggtggtttg tttgccggat 3781caagagctac caactctttt tccgaaggta actggcttca gcagagcgca gataccaaat 3841actgtccttc tagtgtagcc gtagttaggc caccacttca agaactctgt agcaccgcct 3901acatacctcg ctctgctaat cctgttacca gtggctgctg ccagtggcga taagtcgtgt 3961cttaccgggt tggactcaag acgatagtta ccggataagg cgcagcggtc gggctgaacg 4021gggggttcgt gcacacagcc cagcttggag cgaacgacct acaccgaact gagataccta 4081cagcgtgagc attgagaaag cgccacgctt cccgaaggga gaaaggcgga caggtatccg 4141gtaagcggca gggtcggaac aggagagcgc acgagggagc ttccaggggg aaacgcctgg 4201tatctttata gtcctgtcgg gtttcgccac ctctgacttg agcgtcgatt tttgtgatgc 4261tcgtcagggg ggcggagcct atggaaaaac gccagcaacg cggccttttt acggttcctg 4321gccttttgct ggccttttgc tcacatgttpENTR221 Native TAL VP16 Activator Vector Sequence SEQ ID NO: 32    1ctttcctgcg ttatcccctg attctgtgga taaccgtatt accgcctttg agtgagctga   61taccgctcgc cgcagccgaa cgaccgagcg cagcgagtca gtgagcgagg aagcggaaga  121gcgcccaata cgcaaaccgc ctctccccgc gcgttggccg attcattaat gcagctggca  181cgacaggttt cccgactgga aagcgggcag tgagcgcaac gcaattaata cgcgtaccgc  241tagccaggaa gagtttgtag aaacgcaaaa aggccatccg tcaggatggc cttctgctta  301gtttgatgcc tggcagttta tggcgggcgt cctgcccgcc accctccggg ccgttgcttc  361acaacgttca aatccgctcc cggcggattt gtcctactca ggagagcgtt caccgacaaa  421caacagataa aacgaaaggc ccagtcttcc gactgagcct ttcgttttat ttgatgcctg  481gcagttccct actctcgcgt taacgctagc atggatgttt tcccagtcac gacgttgtaa  541aacgacggcc agtcttaagc tcgggcccca aataatgatt ttattttgac tgatagtgac  601ctgttcgttg caacaaattg atgagcaatg cttttttata atgccaactt tgtacaaaaa  661agcaggctgc ggccgcgcca ccatgggaaa acctattcct aatcctctgc tgggcctgga  721ttctaccatg gaccctatta gaagcagaac accctctcca gccagagaac tgctgtctgg  781acctcagcct gatggagtgc agcctacagc cgatagagga gtgtctcctc ctgccggagg  841acctctggat ggactgcctg cccggagaac aatgagcaga acaagactgc cttctcctcc  901agccccatct cctgcctttt ctgccgattc ttttagcgac ctgctgagac agtttgaccc  961cagcctgttt aataccagcc tgttcgatag cctgcctcct tttggagccc accacacaga 1021ggccgccaca ggcgaatggg atgaagtgca gtctggactg agagccgccg atgcccctcc 1081tcctacaatg agagtggccg tgacagccgc cagacctcct agagccaaac ctgcccctag 1141aaggagagcc gcccagcctt ctgatgcctc tcctgccgcc caggtggacc tgagaacact 1201gggatattct cagcagcagc aggagaagat caagcccaag gtgaggtcta cagtggccca 1261gcaccacgaa gccctggtgg gacacggatt tacacacgcc cacattgtgg ccctgtctca 1321gcaccctgcc gccctgggaa cagtggccgt gaaatatcag gatatgattg ccgccctgcc 1381tgaggccaca cacgaagcca ttgtgggagt gggaaaacag tggtctggag ccagagccct 1441ggaagccctg ctgacagtgg ccggagaact gagaggacct cctctgcagc tggatacagg 1501acagctgctg aagattgcca aaaggggcgg agtgaccgcg gtggaagccg tgcacgcctg 1561gagaaatgcc ctgacaggag cccctctgaa cccttgcagg tgccggaatt gccagctggg 1621gcgccctctg gtaaggttgg gaagccctgc aaagtaaact ggatggcttt cttgccgcca 1681aggatctgat ggcgcagggg atcaagctct gatcaagaga caggatgagg atcgtttcgc 1741atgcagttca aagtgtatac ctacaaacgt gaaagccgtt atcgtctgtt tgtggatgtg 1801cagagcgata ttattgatac ccctggtcgt cgtatggtga ttccgctggc ctctgcgcgt 1861ctgctgtctg ataaagtgag ccgtgagctg tatccggtgg tgcatattgg tgatgaaagc 1921tggcgtatga tgaccaccga tatggcgagc gtgccggtga gcgtgattgg cgaagaagtg 1981gcggatctga gccatcgtga aaacgatatc aaaaacgcga ttaacctgat gttttggggc 2041atttaataaa tgtcaggctc ccttatacac agccagtctg cagtcacctg cggatagcat 2101tgtggcccag ctgtctagac ctgatcctgc cctggccgcc ctgacaaatg atcacctggt 2161ggccctggcc tgtctgggag gcagacctgc cctggatgcc gtgaaaaaag gactgcctca 2221cgcccctgcc ctgatcaaga gaacaaatag aagaatcccc gagcggacct ctcacagagt 2281ggccgatcac gcccaggtgg tgagagtgct gggatttttt cagtgtcact ctcaccctgc 2341ccaggccttt gatgatgcca tgacacagtt tggcatgagc agacacggac tgctgcagct 2401gtttagaaga gtgggagtga cagaactgga ggccagaagc ggaacactgc ctccagcctc 2461tcagagatgg gatagaattc tgcaggccag cggaatgaag agagccaaac cttctcctac 2521cagcacccag acacctgatc aggccagcct gcacgccttt gccgattctc tggaacggga 2581tctggacgcc ccttctccta tgcacgaagg agatcagaca agagccagca gcagaaagag 2641aagcaggtct gatagagccg tgacaggacc ttctgcccag cagtcttttg aagtgagagt 2701gcctgaacag agagatgccc tgcatctgcc tctgctgtct tggggagtga aaagacctag 2761aacaagaatc ggaggactgc tggaccctgg aacacctatg gatgccgatc tggtggcctc 2821ttctacagtg gtgtgggaac aggatgccga tccttttgcc ggaacagccg atgatttccc 2881tgcctttaat gaggaagaac tggcctggct gatggaactg ctgcctcagg gatccgcccc 2941tcctacagat gtgtctctgg gagatgagct ccacctggat ggagaagatg tggccatggc 3001ccacgccgat gccctggatg attttgatct ggatatgctg ggagatggcg attctcctgg 3061acctggattt acacctcacg attctgcccc ttatggagcc ctggatatgg ccgattttga 3121gttcgagcag atgttcacag atgccctggg catcgacgag tatggcggct gaaagcttac 3181ccagctttct tgtacaaagt tggcattata agaaagcatt gcttatcaat ttgttgcaac 3241gaacaggtca ctatcagtca aaataaaatc attatttgcc atccagctga tatcccctat 3301agtgagtcgt attacatggt catagctgtt tcctggcagc tctggcccgt gtctcaaaat 3361ctctgatgtt acattgcaca agataaaaat atatcatcat gaacaataaa actgtctgct 3421tacataaaca gtaatacaag gggtgttatg agccatattc aacgggaaac gtcgaggccg 3481cgattaaatt ccaacatgga tgctgattta tatgggtata aatgggctcg cgataatgtc 3541gggcaatcag gtgcgacaat ctatcgcttg tatgggaagc ccgatgcgcc agagttgttt 3601ctgaaacatg gcaaaggtag cgttgccaat gatgttacag atgagatggt cagactaaac 3661tggctgacgg aatttatgcc tcttccgacc atcaagcatt ttatccgtac tcctgatgat 3721gcatggttac tcaccactgc gatccccgga aaaacagcat tccaggtatt agaagaatat 3781cctgattcag gtgaaaatat tgttgatgcg ctggcagtgt tcctgcgccg gttgcattcg 3841attcctgttt gtaattgtcc ttttaacagc gatcgcgtat ttcgtctcgc tcaggcgcaa 3901tcacgaatga ataacggttt ggttgatgcg agtgattttg atgacgagcg taatggctgg 3961cctgttgaac aagtctggaa agaaatgcat aaacttttgc cattctcacc ggattcagtc 4021gtcactcatg gtgatttctc acttgataac cttatttttg acgaggggaa attaataggt 4081tgtattgatg ttggacgagt cggaatcgca gaccgatacc aggatcttgc catcctatgg 4141aactgcctcg gtgagttttc tccttcatta cagaaacggc tttttcaaaa atatggtatt 4201gataatcctg atatgaataa attgcagttt catttgatgc tcgatgagtt tttctaatca 4261gaattggtta attggttgta acactggcag agcattacgc tgacttgacg ggacggcgca 4321agctcatgac caaaatccct taacgtgagt tacgcgtcgt tccactgagc gtcagacccc 4381gtagaaaaga tcaaaggatc ttcttgagat cctttttttc tgcgcgtaat ctgctgcttg 4441caaacaaaaa aaccaccgct accagcggtg gtttgtttgc cggatcaaga gctaccaact 4501ctttttccga aggtaactgg cttcagcaga gcgcagatac caaatactgt ccttctagtg 4561tagccgtagt taggccacca cttcaagaac tctgtagcac cgcctacata cctcgctctg 4621ctaatcctgt taccagtggc tgctgccagt ggcgataagt cgtgtcttac cgggttggac 4681tcaagacgat agttaccgga taaggcgcag cggtcgggct gaacgggggg ttcgtgcaca 4741cagcccagct tggagcgaac gacctacacc gaactgagat acctacagcg tgagcattga 4801gaaagcgcca cgcttcccga agggagaaag gcggacaggt atccggtaag cggcagggtc 4861ggaacaggag agcgcacgag ggagcttcca gggggaaacg cctggtatct ttatagtcct 4921gtcgggtttc gccacctctg acttgagcgt cgatttttgt gatgctcgtc aggggggcgg 4981agcctatgga aaaacgccag caacgcggcc tttttacggt tcctggcctt ttgctggcct 5041tttgctcaca tgtt pENTR221 Native TAL MCS Vector Sequence SEQ ID NO: 33   1 ctttcctgcg ttatcccctg attctgtgga taaccgtatt accgcctttg agtgagctga  61 taccgctcgc cgcagccgaa cgaccgagcg cagcgagtca gtgagcgagg aagcggaaga 121 gcgcccaata cgcaaaccgc ctctccccgc gcgttggccg attcattaat gcagctggca 181 cgacaggttt cccgactgga aagcgggcag tgagcgcaac gcaattaata cgcgtaccgc 241 tagccaggaa gagtttgtag aaacgcaaaa aggccatccg tcaggatggc cttctgctta 301 gtttgatgcc tggcagttta tggcgggcgt cctgcccgcc accctccggg ccgttgcttc 361 acaacgttca aatccgctcc cggcggattt gtcctactca ggagagcgtt caccgacaaa 421 caacagataa aacgaaaggc ccagtcttcc gactgagcct ttcgttttat ttgatgcctg 481 gcagttccct actctcgcgt taacgctagc atggatgttt tcccagtcac gacgttgtaa 541 aacgacggcc agtcttaagc tcgggcccca aataatgatt ttattttgac tgatagtgac 601 ctgttcgttg caacaaattg atgagcaatg cttttttata atgccaactt tgtacaaaaa 661 agcaggctgc ggccgcgcca ccatgggaaa acctattcct aatcctctgc tgggcctgga 721 ttctaccatg gaccctatta gaagcagaac accctctcca gccagagaac tgctgtctgg 781 acctcagcct gatggagtgc agcctacagc cgatagagga gtgtctcctc ctgccggagg 841 acctctggat ggactgcctg cccggagaac aatgagcaga acaagactgc cttctcctcc 901 agccccatct cctgcctttt ctgccgattc ttttagcgac ctgctgagac agtttgaccc 961 cagcctgttt aataccagcc tgttcgatag cctgcctcct tttggagccc accacacaga1021 ggccgccaca ggcgaatggg atgaagtgca gtctggactg agagccgccg atgcccctcc1081 tcctacaatg agagtggccg tgacagccgc cagacctcct agagccaaac ctgcccctag1141 aaggagagcc gcccagcctt ctgatgcctc tcctgccgcc caggtggacc tgagaacact1201 gggatattct cagcagcagc aggagaagat caagcccaag gtgaggtcta cagtggccca1261 gcaccacgaa gccctggtgg gacacggatt tacacacgcc cacattgtgg ccctgtctca1321 gcaccctgcc gccctgggaa cagtggccgt gaaatatcag gatatgattg ccgccctgcc1381 tgaggccaca cacgaagcca ttgtgggagt gggaaaacag tggtctggag ccagagccct1441 ggaagccctg ctgacagtgg ccggagaact gagaggacct cctctgcagc tggatacagg1501 acagctgctg aagattgcca aaaggggcgg agtgaccgcg gtggaagccg tgcacgcctg1561 gagaaatgcc ctgacaggag cccctctgaa cccttgcagg tgccggaatt gccagctggg1621 gcgccctctg gtaaggttgg gaagccctgc aaagtaaact ggatggcttt cttgccgcca1681 aggatctgat ggcgcagggg atcaagctct gatcaagaga caggatgagg atcgtttcgc1741 atgcagttca aagtgtatac ctacaaacgt gaaagccgtt atcgtctgtt tgtggatgtg1801 cagagcgata ttattgatac ccctggtcgt cgtatggtga ttccgctggc ctctgcgcgt1861 ctgctgtctg ataaagtgag ccgtgagctg tatccggtgg tgcatattgg tgatgaaagc1921 tggcgtatga tgaccaccga tatggcgagc gtgccggtga gcgtgattgg cgaagaagtg1981 gcggatctga gccatcgtga aaacgatatc aaaaacgcga ttaacctgat gttttggggc2041 atttaataaa tgtcaggctc ccttatacac agccagtctg cagtcacctg cggatagcat2101 tgtggcccag ctgtctagac ctgatcctgc cctggccgcc ctgacaaatg atcacctggt2161 ggccctggcc tgtctgggag gcagacctgc cctggatgcc gtgaaaaaag gactgcctca2221 cgcccctgcc ctgatcaaga gaacaaatag aagaatcccc gagcggacct ctcacagagt2281 ggccgatcac gcccaggtgg tgagagtgct gggatttttt cagtgtcact ctcaccctgc2341 ccaggccttt gatgatgcca tgacacagtt tggcatgagc agacacggac tgctgcagct2401 gtttagaaga gtgggagtga cagaactgga ggccagaagc ggaacactgc ctccagcctc2461 tcagagatgg gatagaatcc tgcaggccag cggaatgaag agagccaaac cttctcctac2521 cagcacccag acacctgatc aggccagcct gcacgccttt gccgattctc tggaaaggga2581 tctggacgcc ccttctccta tgcacgaagg agatcagaca agagccagca gcagaaagag2641 aagcaggtct gatagagccg tgacaggacc ttctgcccag cagtcttttg aagtgagagt2701 gcctgaacag agagatgccc tgcatctgcc tctgctgtct tggggagtga aaagacctag2761 aacaagaatc ggaggactgc tggaccccgg gacacctatg gatgccgatc tggtggcctc2821 ttctacagtg gtgtgggaac aggatgccga tccttttgcc ggaacagccg atgatttccc2881 tgcctttaat gaggaagaac tggcctggct gatggaactg ctgcctcagg gttcccgttt2941 aaacaagctt gtcgacggta ccgaattcat cgatagtact ctcgagggat ccgagctcaa3001 gatctagcta agtagaccca gctttcttgt acaaagttgg cattataaga aagcattgct3061 tatcaatttg ttgcaacgaa caggtcacta tcagtcaaaa taaaatcatt atttgccatc3121 cagctgatat cccctatagt gagtcgtatt acatggtcat agctgtttcc tggcagctct3181 ggcccgtgtc tcaaaatctc tgatgttaca ttgcacaaga taaaaatata tcatcatgaa3241 caataaaact gtctgcttac ataaacagta atacaagggg tgttatgagc catattcaac3301 gggaaacgtc gaggccgcga ttaaattcca acatggatgc tgatttatat gggtataaat3361 gggctcgcga taatgtcggg caatcaggtg cgacaatcta tcgcttgtat gggaagcccg3421 atgcgccaga gttgtttctg aaacatggca aaggtagcgt tgccaatgat gttacagatg3481 agatggtcag actaaactgg ctgacggaat ttatgcctct tccgaccatc aagcatttta3541 tccgtactcc tgatgatgca tggttactca ccactgcgat ccccggaaaa acagcattcc3601 aggtattaga agaatatcct gattcaggtg aaaatattgt tgatgcgctg gcagtgttcc3661 tgcgccggtt gcattcgatt cctgtttgta attgtccttt taacagcgat cgcgtatttc3721 gtctcgctca ggcgcaatca cgaatgaata acggtttggt tgatgcgagt gattttgatg3781 acgagcgtaa tggctggcct gttgaacaag tctggaaaga aatgcataaa cttttgccat3841 tctcaccgga ttcagtcgtc actcatggtg atttctcact tgataacctt atttttgacg3901 aggggaaatt aataggttgt attgatgttg gacgagtcgg aatcgcagac cgataccagg3961 atcttgccat cctatggaac tgcctcggtg agttttctcc ttcattacag aaacggcttt4021 ttcaaaaata tggtattgat aatcctgata tgaataaatt gcagtttcat ttgatgctcg4081 atgagttttt ctaatcagaa ttggttaatt ggttgtaaca ctggcagagc attacgctga4141 cttgacggga cggcgcaagc tcatgaccaa aatcccttaa cgtgagttac gcgtcgttcc4201 actgagcgtc agaccccgta gaaaagatca aaggatcttc ttgagatcct ttttttctgc4261 gcgtaatctg ctgcttgcaa acaaaaaaac caccgctacc agcggtggtt tgtttgccgg4321 atcaagagct accaactctt tttccgaagg taactggctt cagcagagcg cagataccaa4381 atactgtcct tctagtgtag ccgtagttag gccaccactt caagaactct gtagcaccgc4441 ctacatacct cgctctgcta atcctgttac cagtggctgc tgccagtggc gataagtcgt4501 gtcttaccgg gttggactca agacgatagt taccggataa ggcgcagcgg tcgggctgaa4561 cggggggttc gtgcacacag cccagcttgg agcgaacgac ctacaccgaa ctgagatacc4621 tacagcgtga gcattgagaa agcgccacgc ttcccgaagg gagaaaggcg gacaggtatc4681 cggtaagcgg cagggtcgga acaggagagc gcacgaggga gcttccaggg ggaaacgcct4741 ggtatcttta tagtcctgtc gggtttcgcc acctctgact tgagcgtcga tttttgtgat4801 gctcgtcagg ggggcggagc ctatggaaaa acgccagcaa cgcggccttt ttacggttcc4861 tggccttttg ctggcctttt gctcacatgt tpENTR221 Native TAL FokI Vector Sequence SEQ ID NO: 34    1ctttcctgcg ttatcccctg attctgtgga taaccgtatt accgcctttg agtgagctga   61taccgctcgc cgcagccgaa cgaccgagcg cagcgagtca gtgagcgagg aagcggaaga  121gcgcccaata cgcaaaccgc ctctccccgc gcgttggccg attcattaat gcagctggca  181cgacaggttt cccgactgga aagcgggcag tgagcgcaac gcaattaata cgcgtaccgc  241tagccaggaa gagtttgtag aaacgcaaaa aggccatccg tcaggatggc cttctgctta  301gtttgatgcc tggcagttta tggcgggcgt cctgcccgcc accctccggg ccgttgcttc  361acaacgttca aatccgctcc cggcggattt gtcctactca ggagagcgtt caccgacaaa  421caacagataa aacgaaaggc ccagtcttcc gactgagcct ttcgttttat ttgatgcctg  481gcagttccct actctcgcgt taacgctagc atggatgttt tcccagtcac gacgttgtaa  541aacgacggcc agtcttaagc tcgggcccca aataatgatt ttattttgac tgatagtgac  601ctgttcgttg caacaaattg atgagcaatg cttttttata atgccaactt tgtacaaaaa  661agcaggctgc ggccgcgcca ccatgggaaa acctattcct aatcctctgc tgggcctgga  721ttctaccatg gaccctatta gaagcagaac accctctcca gccagagaac tgctgtctgg  781acctcagcct gatggagtgc agcctacagc cgatagagga gtgtctcctc ctgccggagg  841acctctggat ggactgcctg cccggagaac aatgagcaga acaagactgc cttctcctcc  901agccccatct cctgcctttt ctgccgattc ttttagcgac ctgctgagac agtttgaccc  961cagcctgttt aataccagcc tgttcgatag cctgcctcct tttggagccc accacacaga 1021ggccgccaca ggcgaatggg atgaagtgca gtctggactg agagccgccg atgcccctcc 1081tcctacaatg agagtggccg tgacagccgc cagacctcct agagccaaac ctgcccctag 1141aaggagagcc gcccagcctt ctgatgcctc tcctgccgcc caggtggacc tgagaacact 1201gggatattct cagcagcagc aggagaagat caagcccaag gtgaggtcta cagtggccca 1261gcaccacgaa gccctggtgg gacacggatt tacacacgcc cacattgtgg ccctgtctca 1321gcaccctgcc gccctgggaa cagtggccgt gaaatatcag gatatgattg ccgccctgcc 1381tgaggccaca cacgaagcca ttgtgggagt gggaaaacag tggtctggag ccagagccct 1441ggaagccctg ctgacagtgg ccggagaact gagaggacct cctctgcagc tggatacagg 1501acagctgctg aagattgcca aaaggggcgg agtgaccgcg gtggaagccg tgcacgcctg 1561gagaaatgcc ctgacaggag cccctctgaa cccttgcagg tgccggaatt gccagctggg 1621gcgccctctg gtaaggttgg gaagccctgc aaagtaaact ggatggcttt cttgccgcca 1681aggatctgat ggcgcagggg atcaagctct gatcaagaga caggatgagg atcgtttcgc 1741atgcagttca aagtgtatac ctacaaacgt gaaagccgtt atcgtctgtt tgtggatgtg 1801cagagcgata ttattgatac ccctggtcgt cgtatggtga ttccgctggc ctctgcgcgt 1861ctgctgtctg ataaagtgag ccgtgagctg tatccggtgg tgcatattgg tgatgaaagc 1921tggcgtatga tgaccaccga tatggcgagc gtgccggtga gcgtgattgg cgaagaagtg 1981gcggatctga gccatcgtga aaacgatatc aaaaacgcga ttaacctgat gttttggggc 2041atttaataaa tgtcaggctc ccttatacac agccagtctg cagtcacctg cggatagcat 2101tgtggcccag ctgtctagac ctgatcctgc cctggccgcc ctgacaaatg atcacctggt 2161ggccctggcc tgtctgggag gcagacctgc cctggatgcc gtgaaaaaag gactgcctca 2221cgcccctgcc ctgatcaaga gaacaaatag aagaatcccc gagcggacct ctcacagagt 2281ggccgatcac gcccaggtgg tgagagtgct gggatttttt cagtgtcact ctcaccctgc 2341ccaggccttt gatgatgcca tgacacagtt tggcatgagc agacacggac tgctgcagct 2401gtttagaaga gtgggagtga cagaactgga ggccagaagc ggaacactgc ctccagcctc 2461tcagagatgg gatagaattc tgcaggccag cggaatgaag agagccaaac cttctcctac 2521cagcacccag acacctgatc aggccagcct gcacgccttt gccgattctc tggaacggga 2581tctggacgcc ccttctccta tgcacgaagg agatcagaca agagccagca gcagaaagag 2641aagcaggtct gatagagccg tgacaggacc ttctgcccag cagtcttttg aagtgagagt 2701gcctgaacag agagatgccc tgcatctgcc tctgctgtct tggggagtga aaagacctag 2761aacaagaatc ggaggactgc tggaccctgg aacacctatg gatgccgatc tggtggcctc 2821ttctacagtg gtgtgggaac aggatgccga tccttttgcc ggaacagccg atgatttccc 2881tgcctttaat gaggaagaac tggcctggct gatggaactg ctgcctcagg gatcccagct 2941ggtgaaatct gagctggagg agaagaagtc tgagctgaga cacaagctga agtacgtgcc 3001tcacgagtac atcgagctga tcgagatcgc cagaaatagc acccaggata gaatcctgga 3061gatgaaggtg atggagttct tcatgaaggt gtacggctac agaggaaagc acctgggagg 3121aagcagaaaa cctgacggag ccatttatac agtgggcagc cctatcgatt atggcgtgat 3181cgtggataca aaggcctaca gcggaggcta caatctgcct attggacagg ccgatgagat 3241gcagagatac gtggaggaga accagaccag gaacaagcac atcaacccta acgagtggtg 3301gaaggtgtac ccttctagcg tgaccgagtt caagttcctg tttgtgagcg gccacttcaa 3361gggcaattat aaggcccagc tgaccaggct gaaccacatc acaaattgta atggcgccgt 3421gctgtctgtg gaggaactgc tgattggagg agagatgatt aaggccggaa cactgacact 3481ggaggaggtg agaagaaagt tcaacaacgg cgagatcaac ttctgaaagc ttacccagct 3541ttcttgtaca aagttggcat tataagaaag cattgcttat caatttgttg caacgaacag 3601gtcactatca gtcaaaataa aatcattatt tgccatccag ctgatatccc ctatagtgag 3661tcgtattaca tggtcatagc tgtttcctgg cagctctggc ccgtgtctca aaatctctga 3721tgttacattg cacaagataa aaatatatca tcatgaacaa taaaactgtc tgcttacata 3781aacagtaata caaggggtgt tatgagccat attcaacggg aaacgtcgag gccgcgatta 3841aattccaaca tggatgctga tttatatggg tataaatggg ctcgcgataa tgtcgggcaa 3901tcaggtgcga caatctatcg cttgtatggg aagcccgatg cgccagagtt gtttctgaaa 3961catggcaaag gtagcgttgc caatgatgtt acagatgaga tggtcagact aaactggctg 4021acggaattta tgcctcttcc gaccatcaag cattttatcc gtactcctga tgatgcatgg 4081ttactcacca ctgcgatccc cggaaaaaca gcattccagg tattagaaga atatcctgat 4141tcaggtgaaa atattgttga tgcgctggca gtgttcctgc gccggttgca ttcgattcct 4201gtttgtaatt gtccttttaa cagcgatcgc gtatttcgtc tcgctcaggc gcaatcacga 4261atgaataacg gtttggttga tgcgagtgat tttgatgacg agcgtaatgg ctggcctgtt 4321gaacaagtct ggaaagaaat gcataaactt ttgccattct caccggattc agtcgtcact 4381catggtgatt tctcacttga taaccttatt tttgacgagg ggaaattaat aggttgtatt 4441gatgttggac gagtcggaat cgcagaccga taccaggatc ttgccatcct atggaactgc 4501ctcggtgagt tttctccttc attacagaaa cggctttttc aaaaatatgg tattgataat 4561cctgatatga ataaattgca gtttcatttg atgctcgatg agtttttcta atcagaattg 4621gttaattggt tgtaacactg gcagagcatt acgctgactt gacgggacgg cgcaagctca 4681tgaccaaaat cccttaacgt gagttacgcg tcgttccact gagcgtcaga ccccgtagaa 4741aagatcaaag gatcttcttg agatcctttt tttctgcgcg taatctgctg cttgcaaaca 4801aaaaaaccac cgctaccagc ggtggtttgt ttgccggatc aagagctacc aactcttttt 4861ccgaaggtaa ctggcttcag cagagcgcag ataccaaata ctgtccttct agtgtagccg 4921tagttaggcc accacttcaa gaactctgta gcaccgccta catacctcgc tctgctaatc 4981ctgttaccag tggctgctgc cagtggcgat aagtcgtgtc ttaccgggtt ggactcaaga 5041cgatagttac cggataaggc gcagcggtcg ggctgaacgg ggggttcgtg cacacagccc 5101agcttggagc gaacgaccta caccgaactg agatacctac agcgtgagca ttgagaaagc 5161gccacgcttc ccgaagggag aaaggcggac aggtatccgg taagcggcag ggtcggaaca 5221ggagagcgca cgagggagct tccaggggga aacgcctggt atctttatag tcctgtcggg 5281tttcgccacc tctgacttga gcgtcgattt ttgtgatgct cgtcaggggg gcggagccta 5341tggaaaaacg ccagcaacgc ggccttttta cggttcctgg ccttttgctg gccttttgct 5401cacatgtt pENTR221 Native TAL VP64 Activator Vector SequenceSEQ ID NO: 35    1ctttcctgcg ttatcccctg attctgtgga taaccgtatt accgcctttg agtgagctga   61taccgctcgc cgcagccgaa cgaccgagcg cagcgagtca gtgagcgagg aagcggaaga  121gcgcccaata cgcaaaccgc ctctccccgc gcgttggccg attcattaat gcagctggca  181cgacaggttt cccgactgga aagcgggcag tgagcgcaac gcaattaata cgcgtaccgc  241tagccaggaa gagtttgtag aaacgcaaaa aggccatccg tcaggatggc cttctgctta  301gtttgatgcc tggcagttta tggcgggcgt cctgcccgcc accctccggg ccgttgcttc  361acaacgttca aatccgctcc cggcggattt gtcctactca ggagagcgtt caccgacaaa  421caacagataa aacgaaaggc ccagtcttcc gactgagcct ttcgttttat ttgatgcctg  481gcagttccct actctcgcgt taacgctagc atggatgttt tcccagtcac gacgttgtaa  541aacgacggcc agtcttaagc tcgggcccca aataatgatt ttattttgac tgatagtgac  601ctgttcgttg caacaaattg atgagcaatg cttttttata atgccaactt tgtacaaaaa  661agcaggctgc ggccgcgcca ccatgggaaa acctattcct aatcctctgc tgggcctgga  721ttctaccatg gaccctatta gaagcagaac accctctcca gccagagaac tgctgtctgg  781acctcagcct gatggagtgc agcctacagc cgatagagga gtgtctcctc ctgccggagg  841acctctggat ggactgcctg cccggagaac aatgagcaga acaagactgc cttctcctcc  901agccccatct cctgcctttt ctgccgattc ttttagcgac ctgctgagac agtttgaccc  961cagcctgttt aataccagcc tgttcgatag cctgcctcct tttggagccc accacacaga 1021ggccgccaca ggcgaatggg atgaagtgca gtctggactg agagccgccg atgcccctcc 1081tcctacaatg agagtggccg tgacagccgc cagacctcct agagccaaac ctgcccctag 1141aaggagagcc gcccagcctt ctgatgcctc tcctgccgcc caggtggacc tgagaacact 1201gggatattct cagcagcagc aggagaagat caagcccaag gtgaggtcta cagtggccca 1261gcaccacgaa gccctggtgg gacacggatt tacacacgcc cacattgtgg ccctgtctca 1321gcaccctgcc gccctgggaa cagtggccgt gaaatatcag gatatgattg ccgccctgcc 1381tgaggccaca cacgaagcca ttgtgggagt gggaaaacag tggtctggag ccagagccct 1441ggaagccctg ctgacagtgg ccggagaact gagaggacct cctctgcagc tggatacagg 1501acagctgctg aagattgcca aaaggggcgg agtgaccgcg gtggaagccg tgcacgcctg 1561gagaaatgcc ctgacaggag cccctctgaa cccttgcagg tgccggaatt gccagctggg 1621gcgccctctg gtaaggttgg gaagccctgc aaagtaaact ggatggcttt cttgccgcca 1681aggatctgat ggcgcagggg atcaagctct gatcaagaga caggatgagg atcgtttcgc 1741atgcagttca aagtgtatac ctacaaacgt gaaagccgtt atcgtctgtt tgtggatgtg 1801cagagcgata ttattgatac ccctggtcgt cgtatggtga ttccgctggc ctctgcgcgt 1861ctgctgtctg ataaagtgag ccgtgagctg tatccggtgg tgcatattgg tgatgaaagc 1921tggcgtatga tgaccaccga tatggcgagc gtgccggtga gcgtgattgg cgaagaagtg 1981gcggatctga gccatcgtga aaacgatatc aaaaacgcga ttaacctgat gttttggggc 2041atttaataaa tgtcaggctc ccttatacac agccagtctg cagtcacctg cggatagcat 2101tgtggcccag ctgtctagac ctgatcctgc cctggccgcc ctgacaaatg atcacctggt 2161ggccctggcc tgtctgggag gcagacctgc cctggatgcc gtgaaaaaag gactgcctca 2221cgcccctgcc ctgatcaaga gaacaaatag aagaatcccc gagcggacct ctcacagagt 2281ggccgatcac gcccaggtgg tgagagtgct gggatttttt cagtgtcact ctcaccctgc 2341ccaggccttt gatgatgcca tgacacagtt tggcatgagc agacacggac tgctgcagct 2401gtttagaaga gtgggagtga cagaactgga ggccagaagc ggaacactgc ctccagcctc 2461tcagagatgg gatagaattc tgcaggccag cggaatgaag agagccaaac cttctcctac 2521cagcacccag acacctgatc aggccagcct gcacgccttt gccgattctc tggaacggga 2581tctggacgcc ccttctccta tgcacgaagg agatcagaca agagccagca gcagaaagag 2641aagcaggtct gatagagccg tgacaggacc ttctgcccag cagtcttttg aagtgagagt 2701gcctgaacag agagatgccc tgcatctgcc tctgctgtct tggggagtga aaagacctag 2761aacaagaatc ggaggactgc tggaccctgg aacacctatg gatgccgatc tggtggcctc 2821ttctacagtg gtgtgggaac aggatgccga tccttttgcc ggaacagccg atgatttccc 2881tgcctttaat gaggaagaac tggcctggct gatggaactg ctgcctcagg gatcccctaa 2941gaaaaagcgg aaagtggaag cctctggatc tggcagagcc gatgccctgg atgattttga 3001tctggatatg ctgggaagcg acgccctgga tgatttcgat ctggatatgc tgggatctga 3061cgccctggat gatttcgatc tggatatgct gggatctgac gccctggatg atttcgatct 3121ggacatgctg atcaacagct gaaagcttac ccagctttct tgtacaaagt tggcattata 3181agaaagcatt gcttatcaat ttgttgcaac gaacaggtca ctatcagtca aaataaaatc 3241attatttgcc atccagctga tatcccctat agtgagtcgt attacatggt catagctgtt 3301tcctggcagc tctggcccgt gtctcaaaat ctctgatgtt acattgcaca agataaaaat 3361atatcatcat gaacaataaa actgtctgct tacataaaca gtaatacaag gggtgttatg 3421agccatattc aacgggaaac gtcgaggccg cgattaaatt ccaacatgga tgctgattta 3481tatgggtata aatgggctcg cgataatgtc gggcaatcag gtgcgacaat ctatcgcttg 3541tatgggaagc ccgatgcgcc agagttgttt ctgaaacatg gcaaaggtag cgttgccaat 3601gatgttacag atgagatggt cagactaaac tggctgacgg aatttatgcc tcttccgacc 3661atcaagcatt ttatccgtac tcctgatgat gcatggttac tcaccactgc gatccccgga 3721aaaacagcat tccaggtatt agaagaatat cctgattcag gtgaaaatat tgttgatgcg 3781ctggcagtgt tcctgcgccg gttgcattcg attcctgttt gtaattgtcc ttttaacagc 3841gatcgcgtat ttcgtctcgc tcaggcgcaa tcacgaatga ataacggttt ggttgatgcg 3901agtgattttg atgacgagcg taatggctgg cctgttgaac aagtctggaa agaaatgcat 3961aaacttttgc cattctcacc ggattcagtc gtcactcatg gtgatttctc acttgataac 4021cttatttttg acgaggggaa attaataggt tgtattgatg ttggacgagt cggaatcgca 4081gaccgatacc aggatcttgc catcctatgg aactgcctcg gtgagttttc tccttcatta 4141cagaaacggc tttttcaaaa atatggtatt gataatcctg atatgaataa attgcagttt 4201catttgatgc tcgatgagtt tttctaatca gaattggtta attggttgta acactggcag 4261agcattacgc tgacttgacg ggacggcgca agctcatgac caaaatccct taacgtgagt 4321tacgcgtcgt tccactgagc gtcagacccc gtagaaaaga tcaaaggatc ttcttgagat 4381cctttttttc tgcgcgtaat ctgctgcttg caaacaaaaa aaccaccgct accagcggtg 4441gtttgtttgc cggatcaaga gctaccaact ctttttccga aggtaactgg cttcagcaga 4501gcgcagatac caaatactgt ccttctagtg tagccgtagt taggccacca cttcaagaac 4561tctgtagcac cgcctacata cctcgctctg ctaatcctgt taccagtggc tgctgccagt 4621ggcgataagt cgtgtcttac cgggttggac tcaagacgat agttaccgga taaggcgcag 4681cggtcgggct gaacgggggg ttcgtgcaca cagcccagct tggagcgaac gacctacacc 4741gaactgagat acctacagcg tgagcattga gaaagcgcca cgcttcccga agggagaaag 4801gcggacaggt atccggtaag cggcagggtc ggaacaggag agcgcacgag ggagcttcca 4861gggggaaacg cctggtatct ttatagtcct gtcgggtttc gccacctctg acttgagcgt 4921cgatttttgt gatgctcgtc aggggggcgg agcctatgga aaaacgccag caacgcggcc 4981tttttacggt tcctggcctt ttgctggcct tttgctcaca tgttpENTR221 Native TAL Repressor Vector Sequence SEQ ID NO: 36    1CTTTCCTGCGTTATC CCCTGATTCTGTGGA TAACCGTATTACCGC CTTTGAGTGAGCTGA TACCGCTCGCCGCAG  76CCGAACGACCGAGCG CAGCGAGTCAGTGAG CGAGGAAGCGGAAGA GCGCCCAATACGCAA ACCGCCTCTCCCCGC 151GCGTTGGCCGATTCA TTAATGCAGCTGGCA CGACAGGTTTCCCGA CTGGAAAGCGGGCAG TGAGCGCAACGCAAT 226TAATACGCGTACCGC TAGCCAGGAAGAGTT TGTAGAAACGCAAAA AGGCCATCCGTCAGG ATGGCCTTCTGCTTA 301GTTTGATGCCTGGCA GTTTATGGCGGGCGT CCTGCCCGCCACCCT CCGGGCCGTTGCTTC ACAACGTTCAAATCC 376GCTCCCGGCGGATTT GTCCTACTCAGGAGA GCGTTCACCGACAAA CAACAGATAAAACGA AAGGCCCAGTCTTCC 451GACTGAGCCTTTCGT TTTATTTGATGCCTG GCAGTTCCCTACTCT CGCGTTAACGCTAGC ATGGATGTTTTCCCA 526GTCACGACGTTGTAA AACGACGGCCAGTCT TAAGCTCGGGCCCCA AATAATGATTTTATT TTGACTGATAGTGAC 601CTGTTCGTTGCAACA AATTGATGAGCAATG CTTTTTTATAATGCC AACTTTGTACAAAAA AGCAGGCTGCGGCCG 676CGCCACCATGGGAAA ACCTATTCCTAATCC TCTGCTGGGCCTGGA TTCTACCGACCCTAT TAGAAGCAGAACACC 751TTCTCCAGCCAGAGA GCTGCTGTCTGGACC TCAGCCTGATGGAGT GCAGCCTACAGCCGA TAGAGGAGTGTCTCC 826TCCTGCCGGAGGACC TCTGGATGGACTGCC TGCCCGGAGAACAAT GAGCAGAACAAGACT GCCTTCTCCTCCAGC 901CCCATCTCCTGCCTT TTCTGCCGATTCTTT TAGCGACCTGCTGAG ACAGTTTGACCCCAG CCTGTTTAATACCAG 976CCTGTTCGATAGCCT GCCTCCTTTTGGAGC CCACCACACAGAGGC CGCCACAGGCGAATG GGATGAAGTGCAGTC1051TGGACTGAGAGCCGC CGATGCCCCTCCTCC TACAATGAGAGTGGC CGTGACAGCCGCCAG ACCTCCTAGAGCCAA1126ACCTGCCCCTAGAAG GAGAGCCGCCCAGCC TTCTGATGCCTCTCC TGCCGCCCAGGTGGA TCTGAGAACACTGGG1201ATATTCTCAGCAGCA GCAGGAGAAGATCAA GCCCAAGGTGAGATC TACAGTGGCCCAGCA CCACGAAGCCCTGGT1276GGGACACGGATTTAC ACACGCCCACATTGT GGCCCTGTCTCAGCA CCCTGCCGCCCTGGG AACAGTGGCCGTGAA1351ATATCAGGATATGAT TGCCGCCCTGCCTGA GGCCACACACGAAGC CATTGTGGGAGTGGG AAAACAGTGGTCTGG1426AGCCAGAGCCCTGGA AGCCCTGCTGACAGT GGCCGGAGAACTGAG AGGACCTCCTCTGCA GCTGGATACAGGACA1501GCTGCTGAAGATTGC CAAAAGGGGCGGAGT GACCGCGGTGGAAGC CGTGCACGCCTGGAG AAATGCCCTGACAGG1576AGCCCCTCTGAACCC TTGCAGGTGCCGGAA TTGCCAGCTGGGGCG CCCTCTGGTAAGGTT GGGAAGCCCTGCAAA1651GTAAACTGGATGGCT TTCTTGCCGCCAAGG ATCTGATGGCGCAGG GGATCAAGCTCTGAT CAAGAGACAGGATGA1726GGATCGTTTCGCATG CAGTTCAAAGTGTAT ACCTACAAACGTGAA AGCCGTTATCGTCTG TTTGTGGATGTGCAG1801AGCGATATTATTGAT ACCCCGGGTCGTCGT ATGGTGATTCCGCTG GCCTCTGCGCGTCTG CTGTCTGATAAAGTG1876AGCCGTGAGCTGTAT CCGGTGGTGCATATT GGTGATGAAAGCTGG CGTATGATGACCACC GATATGGCGAGCGTG1951CCGGTGAGCGTGATT GGCGAAGAAGTGGCG GATCTGAGCCATCGT GAAAACGATATCAAA AACGCGATTAACCTG2026ATGTTTTGGGGCATT TAATAAATGTCAGGC TCCCTTATACACAGC CAGTCTGCAGTCACC TGCGGATAGCATTGT2101GGCCCAGCTGTCTAG ACCTGATCCTGCCCT GGCCGCCCTGACAAA TGATCACCTGGTGGC CCTGGCCTGTCTGGG2176AGGCAGACCTGCCCT GGATGCCGTGAAAAA AGGACTGCCTCACGC CCCTGCCCTGATCAA GAGAACAAATAGAAG2251AATCCCCGAGCGGAC CTCTCACAGAGTGGC CGATCACGCCCAGGT GGTGAGAGTGCTGGG ATTTTTTCAGTGTCA2326CTCTCACCCTGCCCA GGCCTTTGATGATGC CATGACACAGTTTGG CATGAGCAGACACGG ACTGCTGCAGCTGTT2401TAGAAGAGTGGGAGT GACAGAACTGGAGGC CAGATCTGGTACCCT GCCTCCTGCCTCTCA GAGATGGGATAGAAT2476TCTGCAGCTGAAGCT GCTGTCTAGCATTGA ACAGGCCTGCCCCAA GAAGAAGAGAAAAGT GGACGACGCCAAGAG2551CCTGACAGCCTGGAG CAGAACACTGGTGAC ATTCAAGGATGTGTT CGTGGACTTCACCAG GGAGGAATGGAAACT2626GCTGGATACAGCCCA GCAGATCGTGTACAG AAATGTGATGCTGGA GAACTACAAGAACCT GGTGTCTCTGGGCTA2701CCAGCTGACAAAGCC TGATGTGATTCTGAG ACTGGAGAAGGGCGA AGAACCTTGGCTGGT GGAAAGAGAGATCCA2776CTGAAAGCTTACCCA GCTTTCTTGTACAAA GTTGGCATTATAAGA AAGCATTGCTTATCA ATTTGTTGCAACGAA2851CAGGTCACTATCAGT CAAAATAAAATCATT ATTTGCCATCCAGCT GATATCCCCTATAGT GAGTCGTATTACATG2926GTCATAGCTGTTTCC TGGCAGCTCTGGCCC GTGTCTCAAAATCTC TGATGTTACATTGCA CAAGATAAAAATATA3001TCATCATGAACAATA AAACTGTCTGCTTAC ATAAACAGTAATACA AGGGGTGTTATGAGC CATATTCAACGGGAA3076ACGTCGAGGCCGCGA TTAAATTCCAACATG GATGCTGATTTATAT GGGTATAAATGGGCT CGCGATAATGTCGGG3151CAATCAGGTGCGACA ATCTATCGCTTGTAT GGGAAGCCCGATGCG CCAGAGTTGTTTCTG AAACATGGCAAAGGT3226AGCGTTGCCAATGAT GTTACAGATGAGATG GTCAGACTAAACTGG CTGACGGAATTTATG CCTCTTCCGACCATC3301AAGCATTTTATCCGT ACTCCTGATGATGCA TGGTTACTCACCACT GCGATCCCCGGAAAA ACAGCATTCCAGGTA3376TTAGAAGAATATCCT GATTCAGGTGAAAAT ATTGTTGATGCGCTG GCAGTGTTCCTGCGC CGGTTGCATTCGATT3451CCTGTTTGTAATTGT CCTTTTAACAGCGAT CGCGTATTTCGTCTC GCTCAGGCGCAATCA CGAATGAATAACGGT3526TTGGTTGATGCGAGT GATTTTGATGACGAG CGTAATGGCTGGCCT GTTGAACAAGTCTGG AAAGAAATGCATAAA3601CTTTTGCCATTCTCA CCGGATTCAGTCGTC ACTCATGGTGATTTC TCACTTGATAACCTT ATTTTTGACGAGGGG3676AAATTAATAGGTTGT ATTGATGTTGGACGA GTCGGAATCGCAGAC CGATACCAGGATCTT GCCATCCTATGGAAC3751TGCCTCGGTGAGTTT TCTCCTTCATTACAG AAACGGCTTTTTCAA AAATATGGTATTGAT AATCCTGATATGAAT3826AAATTGCAGTTTCAT TTGATGCTCGATGAG TTTTTCTAATCAGAA TTGGTTAATTGGTTG TAACACTGGCAGAGC3901ATTACGCTGACTTGA CGGGACGGCGCAAGC TCATGACCAAAATCC CTTAACGTGAGTTAC GCGTCGTTCCACTGA3976GCGTCAGACCCCGTA GAAAAGATCAAAGGA TCTTCTTGAGATCCT TTTTTTCTGCGCGTA ATCTGCTGCTTGCAA4051ACAAAAAAACCACCG CTACCAGCGGTGGTT TGTTTGCCGGATCAA GAGCTACCAACTCTT TTTCCGAAGGTAACT4126GGCTTCAGCAGAGCG CAGATACCAAATACT GTCCTTCTAGTGTAG CCGTAGTTAGGCCAC CACTTCAAGAACTCT4201GTAGCACCGCCTACA TACCTCGCTCTGCTA ATCCTGTTACCAGTG GCTGCTGCCAGTGGC GATAAGTCGTGTCTT4276ACCGGGTTGGACTCA AGACGATAGTTACCG GATAAGGCGCAGCGG TCGGGCTGAACGGGG GGTTCGTGCACACAG4351CCCAGCTTGGAGCGA ACGACCTACACCGAA CTGAGATACCTACAG CGTGAGCATTGAGAA AGCGCCACGCTTCCC4426GAAGGGAGAAAGGCG GACAGGTATCCGGTA AGCGGCAGGGTCGGA ACAGGAGAGCGCACG AGGGAGCTTCCAGGG4501GGAAACGCCTGGTAT CTTTATAGTCCTGTC GGGTTTCGCCACCTC TGACTTGAGCGTCGA TTTTTGTGATGCTCG4576TCAGGGGGGCGGAGC CTATGGAAAAACGCC AGCAACGCGGCCTTT TTACGGTTCCTGGCC TTTTGCTGGCCTTTT4651 GCTCACATG

EXAMPLES Example 1: Placing an Order with a Service Provider OfferingTAL-Related Services Via a Web-Based Platform

This example describes one embodiment of a possible workflow underlyinga customer order related to TAL-specific services. The order processbegins with a customer inquiry or request. The request may be receiveddirectly via the portal or may be received otherwise, such as, e.g., byemail, via the supplier's webpage, per phone or fax etc. The customerwill be asked to create an account (if the customer is not alreadyregistered with the supplier) to be able to log in to the order portalof the service provider. After login, the customer encounters a TALdesigner interface where the customer can choose from different options(see FIG. 1A). The customer is asked to enter a minimal TAL targetsequence (e.g., a 24 base nucleotide sequence) and select a destinationhost from a drop-down menu. In this example, the customer wants to ordera TAL expression construct for expression in human cells and willtherefore select “human” as destination host. The selection of adestination host may influence the parameters chosen for geneoptimization, the parts used for manufacturing (e.g., selected from amaterial repository) and the assembly strategy. Next, the customer isasked to select an effector domain from a drop-down menu. Alternatively,the customer can enter or paste the amino acid sequence of a desiredeffector domain in a “customized effector” field. The customer chooses anuclease domain from the menu and additionally pastes the amino acidsequences of two mutated variants of the nuclease domain (which are notoffered in the drop-down menu) into the customized effector field. Asthe selected nuclease domains only function as dimers, the systemautomatically generates a note that a pair of TAL effector nucleaseconstructs will be generated. In a next step, the customer chooses atarget vector from one or more drop-down menus. The customer can choose,e.g., a cloning vector or an expression vector or both. In this example,the customer wishes the different TAL effector nuclease constructs to besubcloned into a pENTR GATEWAY® functional vector and selects arespective vector with a kanamycin resistance gene. Furthermore, thecustomer selects specific enzyme cleavage sites for the 5′ and 3′ endsof the genes and selects a His-Tag to be fused to the 3′-end of eachconstruct for purification purposes. In a next step, the customer canchoose from additional services and orders a TAL binding proof for thespecified TAL effector nuclease. Finally, the customer requests a quotefor manufacture of a stable TAL effector nuclease cell line where thecell line would be provided by the customer and adds materialspecifications in a comment box. To finalize the inquiry he presses asubmit button and the specifications are analysed and processed by theservice provider.

Example 2a: Development of a TAL Binding Plate Assay

To validate that engineered TAL effectors are capable of binding totheir predicted target sites, we developed a plate binding assay. Forthis purpose, TAL effectors targeting Hax3 DNA binding box were clonedinto a pDEST17 Gateway® vector containing a T7 promoter and placing aHis tag at the N terminus of the proteins. TAL effectors were expressedusing a rabbit reticulocyte in vitro transcription/translation systemTNT^(R) quick Coupled Transcription/Translation System (LifeTechnologies Corp., Carlsbad). The expressed TAL effectors where thencaptured by nickel coated 96-well plates Pierce Nickel Coated Plates(Pierce Biotechnology, Rockford, Ill.). Plates were washed with a buffer(30 mM KCl, 0.1 mM DDT, 0.1 mM EDTA, 10 mM Tris, pH7.4) to removeunbound components and where then incubated with a TAL DNA binding probein binding buffer (30 mM KCl, 0.1 mM DDT, 0.1 mM EDTA, 10 mM Tris, pH7.4). To generate the binding probe, DNA oligonucleotides containing theDNA binding sequence and between 5 and 10 extra nucleotides at each endwere synthesized through Life Technologies (Carlsbad, Calif.). Bindingprobes were then generated by annealing two complimentary DNAoligonucleotides in a thermal cycler PCR machine. After incubation,unbound probe was removed by washing the plates. Next, the DNA bound toTAL effectors was labeled using QUANT-IT™ PICOGREEN® dsDNA reagent.After another washing step the fluorescence of the samples was measuredat excitation 480 nM, emission 520 nM using a spectroflurometer. In theexample illustrated in FIG. 4A specificity for a specific TAL protein(containing Hax3 binding repeats) was determined in the presence ofincreasing amounts of two different binding probes containing Hax3 andArtX1 (negative control) binding sites. The results of this experimentdemonstrate that the developed plate binding assay is a quick andreliable method to validate binding capacity of engineered TAL effectorsto a given target sequence.

Example 2b: Development of a Bead-Based Binding Assay

As alternative, a bead-based assay for quantitative analysis of TALbinding was established as illustrated in FIG. 4B. In this assay formatHis-tagged TAL proteins expressed in a cell free system are covalentlycoupled to magnetic Ni-beads and incubated with target DNA. Beads arethen washed and subject to quantitative PCR. In the illustrated exampleequal amounts of TAL protein were covalently coupled to activatedDYNABEADS® and incubated with a 5-fold molar excess (Sample 1) andequimolar amount (Sample 2) of plasmid target DNA. Total bound DNA wasthen quantified by qPCR with plasmid specific primers and SYBR® Green.FIG. 4C shows another approach to determine TAL binding specificity(e.g., of Hax3) to target DNA by means of a gel-shift assay in thepresence of increasing amounts of the expressed TAL protein.

Example 3: Design-Based Synthesis of TAL Effector Expression Constructs

The following is a protocol regarding how TAL effectors molecules can bedesigned and manufactured.

Sequence design and optimization: A synthetic version of each TALcassette and effector fusion wild-type sequences was generatedreflecting the codon bias of the target organism (e.g., the codon usageof mammalian cells, bacteria, yeast, microalgae). Starting with awild-type sequence, the target organism codon-preference as specified ina codon usage table (CUT, http://www.kazusa.or.jp/codon/) wastransferred to the primary sequence based on the degeneracy of thegenetic code. Basically, the amino acid sequence was back-translatedinto a nucleic acid sequence by exchanging codons with targetorganism-preferred codons wherever possible. For this purpose theGENEOPTIMIZER® software tool was used. This multi-parameter optimizationtool fulfils two functions: first, the software optimizes the nucleicacid sequence for a specific purpose taking multiple parameters intoaccount, such as codon usage, GC content, repetitive sequences that mayinterfere with production, RNA secondary structure, cryptic splice sitesor other inhibitory motifs or sequences relevant for specific hosts. Incertain instances, it may be beneficial to harmonize the GC content ifthis results in more favourable production conditions or more balancedprotein folding. Second, the oligonucleotide sequences for genesynthesis are determined which means that the full-length nucleic acidmolecule that is to be synthesized will be fragmented into smallersubfragments that will again be assembled from oligonucleotides. Fortrimer and higher order assembly, TAL cassettes were designed asfollows: 44 cassettes (monomers) were synthesized reflecting 11 discretecassette positions within the final TAL effector for four nucleotideseach.

Synthesis of cassette vectors: The in silico designed coding sequencesof 44 cassettes reflecting 11 discrete cassette positions in a TALeffector for 4 nucleotide binding categories were broken down intooverlapping oligonucleotides. The sense strand sequence was then splitinto three sequential L-oligos of 50-60 nucleotides (nt) in length tocover the complete nucleic acid sequence without gaps. Likewise, theantisense strand was split into two shorter M-oligos of approximately 40nt in length partially overlapping the corresponding, complementaryL-oligos. For a second amplification step and in preparation forcloning, two terminal oligos, pf (primer forward) and pb (primerbackward) were designed. These terminal oligos should provide a 25 ntoverlap with the sequence and an additional 12 nt protruding sequencecontaining homologous regions with the destination vector for subsequentcloning. The designed oligonucleotides were then produced byconventional oligonucleotide synthesis procedures. The syntheticcassettes were then generated via stepwise PCR amplification, asfollows: In a first amplification round referred to as SCR (SequentialChain Reaction) for each fragment 5 μl of an oligo-pool containing allL- and M-oligos at a final concentration of 15 nM, 18 μl H₂O and 27 μlPCR master-mix were mixed together and subjected to PCR using theprotocol as outlined in TABLE 21. The SCR product cannot be used forcloning directly, but has to be further amplified using a methodreferred to as SPCR (Sequential PCR) to introduce the homologous withthe terminal pf and pb primers. Seven μl of the SCR reaction were mixedwith each 2 μl of pf and pb (at a concentration of 15 μM, each), 27 μlPCR master-mix and 14 μl H2O, and were subjected to PCR using theprotocol as specified in TABLE 22. The product of this PCR reaction wasanalysed on an agarose gel. The destination vector and the syntheticcassettes were then subject to an exonuclease based reaction asdescribed in Aslanidis and de Jong (Ligation-independent cloning of PCRproducts (LIC-POR); Nucleic Acids Research, Vol. 18, No. 20 6069 (1990))to generate single stranded overhangs for subsequentligation-independent cloning. The annealed product was directlytransformed into E. coli and correct clones were selected on kanamycinLB medium.

Generation of trimer library: The resulting 44 cassette vectors(together a cassette library) were then used to generate a trimerlibrary (see FIG. 7A). In order to produce each required trimer for eachdefined position three TAL cassettes were assembled to generate onetrimer according to the type IIS assembly protocol following a libraryapproach (according to the example, 8 libraries were produced). For theassembly of each three cassettes into one vector a 50 μl total volumeapproach was used. Two hundred ng of the target vector (containing ccdBas counter-selectable marker gene) and 50 ng of each cassette vector (atotal of 12 TAL monomers) were mixed with 2 μl restriction enzyme BsaIand incubated for 1 hour at 37° C. Following addition of 1 μl buffer, 3μl 10 mM ATP, 1 μl T4 ligase and 5 μl H₂O the mix was incubated for 1 hat 22° C. followed by an additional digestion step for 1 hour at 37° C.(no additional restriction enzyme added). The enzyme was inactivated for10 minutes at 65° C. 2-5 μl of the reaction sample were directly usedfor transformation of competent E. coli cells. Finally, a sufficientamount of colonies was sequenced to ensure that every individualcombination (64 per library) was recovered. In this example the finaltrimer library consisted of 512 trimer clones (8 libraries of 4³randomly assembled cassettes). The half cassettes (half repeats) arereflected by position 11 and allow for synthesis of TAL effectors with17.5 and 23.5 cassettes. The trimer library was then used as a basis forall higher order TAL effector assembly projects which allows building apuzzle from less larger pieces thereby maximizing assembly efficiency.

Assembly of TAL effector fusions. To build a TAL effector with 24cassettes with desired binding specificity, trimer vectors were selectedfrom the library for each position using a design tool described in moredetail elsewhere herein and were assembled following a 2-step type IISassembly method as described above (FIG. 7B). The desired TAL effectorsequence was split into two sub-parts of similar length (two sets of12-mers). Each sub-part was assembled into one capture vector using the4 respective trimers. In a second step the two sub-parts of 12 cassetteseach were assembled into a functional vector comprising the N- andC-terminal flanking sequences and a 3′ located effector fusion (a FokInuclease cleavage domain). In this example a Gateway entry clone wasused as functional vector which may serve for further cloning andrecombination.

TABLE 21 PCR protocol 1 Cycle Number Step ° C. Time 1 1 1 95 04:00 2 301 95 00:30 2 60 →40* 00:30 3 72 01:00 3 1 1 72 04:00 4 1 1  4 ∞ *use atouch down program starting with 60° C. and ending with 40° C.

TABLE 22 PCR protocol 2 Cycle Number Step ° C. Time 1 1 1 95 04:00 2 301 95 00:30 2 58 00:30 3 72 01:00 3 1 1 72 04:00 4 1 1  4 ∞

Example 3a: TAL Effector Fusion Assembly According to a First Protocol

To clone four trimers (=12 cassettes) into each capture vector (step (i)assembly), two parallel assembly reactions were prepared on day 1. Foreach reaction, 50 ng of each of the four selected trimer vectors weremixed with 200 ng of the capture vector, 40 Units (2 μl) of type IISrestriction enzyme BsaI (New England Biolabs (NEB), Ipswich, Mass.) andincubated for 1 hour at 37° C. in a 20 μl reaction volume containing 2μl of NEB4 buffer. Following addition of 1 μl buffer NEB4, 3 μl 10 mMATP, 400 Units (1 μl) T4 ligase (NEB) and 5 μl H2O, the reactionmixtures were incubated for 1 hour at 22° C. to allow for ligation ofassembled capture vectors carrying 12 cassettes each followed by anoptional additional digestion step for 1 hour at 37° C. (no additionalrestriction enzyme added). The enzymes were inactivated for 10 minutesat 65° C. before 5 μl of each reaction mixture was transformedseparately into chemically compentent E. coli which were platedovernight on selective media. On day 2, 8 cfu per assembled capturevector were screened for correct insert size by cPCR (PCR reaction inpresence of two primers binding next to TAL repeat subsets in oppositedirection in vector backbone) and 15 ml of LB-medium with spectinomycin(50 μg/ml final conc.) were inoculated with selected cfu and grownovernight at 37° C. On day 3 overnight cultures were harvested andplasmids were prepared using the PureYield™ Plasmid Midiprep System fromPromega (Madison, Wis.) according to the manufacturer's instructionsyielding ˜100 μg plasmid DNA from 15-ml cultures.

Sequence verification of the assembled TAL repeat subsets was performedon an ABI Sequencer 3730 using primers binding next to the TAL repeatsubsets in the vector backbones. On day 4, step (ii) assembly wasperformed to clone TAL repeat subsets (12 cassettes from each capturevector) into a functional vector containing the TAL N- and C-terminaldomains. For this purpose 50 ng of each purified and sequence-verifiedcapture vector and 200 ng of the functional vector were mixed andincubated with 4 Units (2 μl) of type IIS restriction enzyme AarI(Fermentas, Hanover, Md.) in the presence of 0.5 μM of oligonucleotides(as recommended by Fermentas) and incubated for 1 hour at 37° C. in a 20μl reaction volume containing 2 μl of NEB4 buffer.

Following addition of 1 μl buffer NEB4, 3 μl 10 mM ATP, 1 μl T4 ligase(NEB) and 5 μl H2O, the reaction mixture was incubated for 1 hour at 22°C. to allow for ligation of assembled functional vector carrying 24cassettes followed by an additional digestion step for 1 hour at 37° C.The enzymes were inactivated for 10 minutes at 65° C. before 5 μl ofeach reaction mixture was transformed into chemically compentent E. coliwhich were plated overnight on selective media. On day 5 the sameprocedure as on day 2 was performed including cPCR of 8 cfu offunctional vectors followed by inoculation of 15 ml LB-medium overnightcultures. On day 6, overnight cultures were harvested and plasmids wereprepared according to the same protocol as outlined for day 3 resultingin ˜100 μg amounts of purified functional vector. Finally, thefull-length TAL effector fusion was subject to sequencing as describedfor day 3 above in the presence of additional primers binding tospecific TAL repeats (as described in detail elsewhere herein).

Example 3b: TAL Effector Fusion Assembly According to a Second Protocol

On day 1 step (i) assembly of four trimers into each capture vector wasperformed as described for day 1 of Example 3a and 5 μl of each assemblyreaction mixture were transformed into chemically compentent E. coli.The bacteria were then regenerated in 500 μl LB medium for 1 hour at 37°C. Medium was added up to 2 ml and supplied with 50 μg/ml ofspectinomycin for selection, and the cultures were grown at 37° C. overnight. On day 2, plasmids were prepared from the 2-ml cultures eachcontaining a pool of transformants using the Plasmid Mini Kit fromQiagen (Hilden, Germany) according to the manufacturer's instructions.The purified first and second capture vector plasmid preparations weresubsequently used in the step (ii) assembly reaction without furthersequence verification. For this purpose, 50 ng of each purified capturevector pool was mixed with 200 ng of the functional vector and incubatedwith 4 Units (2 μl) of type IIS restriction enzyme AarI (Fermentas,Hanover, Md.) in the presence of 0.5 μM of oligonucleotides (asrecommended by Fermentas) for 1 hour at 37° C. in a 20 μl reactionvolume containing 2 μl of NEB4 buffer. Following addition of 1 μl bufferNEB4, 3 μl 10 mM ATP, 1 μl T4 ligase (NEB) and 5 μl H2O, the reactionmixture was incubated for 1 h at 22° C. to allow for ligation ofassembled functional vector followed by an optional additional digestionstep for 1 hour at 37° C. The enzymes were inactivated for 10 minutes at65° C. before 5 μl of reaction mixture were transformed into chemicallycompentent E. coli which were plated over night on selective media. Onday 3, 8 cfu were screened for correct insert size by cPCR as outlinedfor day 2 of Example 3a and 15 ml of LB-medium with kanamycine (25μg/ml) were inoculated with selected cfu and grown overnight at 37° C.On day 4 overnight cultures were harvested and plasmids were preparedand sequenced as outlined for day 3 in Example 3a resulting in ˜100 μgamounts of purified sequence-verified functional vector.

Example 3c: TAL Effector Fusion Assembly According to a Third Protocol

On day 1, step (i) assembly of four trimers into each capture vector wasperformed as described for day 1 of Examples 3a and 3b. Twenty μl ofeach step (i) reaction mixture containing assembled and ligated firstand second capture vectors carrying TAL repeat subsets were mixed with200 ng of the functional vector and incubated with 8 Units (4 μl) oftype IIS restriction enzyme AarI (Fermentas) in the presence of 0.5 μMof oligonucleotides (as recommended by Fermentas) for 1 hour at 37° C.in a 80 μl reaction volume containing 4 μl of NEB4 buffer. Followingaddition of 2 μl buffer NEB4, 6 μl 10 mM ATP, 2 μl T4 ligase (NEB) and10 μl H₂O the reaction mixture was incubated for 1 h at 22° C. to allowfor ligation of assembled functional vector followed by an additionaldigestion step for 1 hour at 37° C. The enzymes were inactivated for 10minutes at 65° C. before 10 μl of reaction mixture were transformed intochemically compentent E. coli which were plated over night on selectivemedia. On day 2, 8 cfu were screened for correct insert size by cPCR asoutlined for day 2 of Example 3a and 15 ml of LB-medium with kanamycin(25 μg/ml) were inoculated with selected cfu and grown overnight at 37°C. On day 3 overnight cultures were harvested and plasmids were preparedand sequenced as outlined for day 3 in Example 3a resulting in ˜100 μgamounts of purified sequence-verified functional vector.

Example 4: A Genetic Inverter System to Demonstrate TAL EffectorFunction in E. coli

This example describes the development of a genetic inverter systemcreated to test whether plant-derived AvrBs3 TAL proteins are active inE. coli. For this purpose a reporter plasmid encoding a destabilized GFPprotein which has a short half-life in cells was designed wherein GFPexpression is under the transcriptional control of a synthetic pTrc-UPA(upregulated by AvrBs3) promoter harboring the natural UPA20 TAL bindingsite (see FIG. 13A). Besides, arabinose-inducible TAL expressioncassettes were constructed for expression of different C-terminallytruncated variants of AvrBs3 TAL protein (Avr28, Avr63, Avr93) withbinding specificity for the UPA20 target site. To validate correct TALprotein expression in E. coli, the three AvrBs3 truncation constructswere expressed separately as fusions to thioredoxin and protein extractswhere analysed by SDS-PAGE (see. FIG. 13B) showing the calculatedmolecular weight for each TAL protein. Following co-transformation ofthe GFP reporter plasmid together with one of the AvrBs3 expressionconstructs into E. coli cells, TAL expression was induced by addition ofarabinose. Reporter strains expressing AvrBs3 constructs showedsignificantly decreased fluorescence relative to control strains notinduced with arabinose (FIG. 13C). These results demonstrate that TALeffectors are capable of repressing reporter gene activity in anon-natural bacterial host (E. coli).

Example 5: A TAL Genetic Circuit Developed for Microalgae

To test TAL function in microalgae, a TAL genetic circuit for microalgaewas constructed by replacing the activation domain of Hsp70A with3×AvrBs3 TAL binding site upstream of an RbcS2 minimal plant promoterthat drives expression of a luciferase reporter gene at very lowactivity. Meanwhile, an AvrBs3 TAL effector was fused in frame to theN-terminus of a hygromycin resistance gene under control of aconstitutive pB tublin promoter (see FIG. 14). As positive control thegenetic circuit was expressed under control of a strong chimericHsp70A-RbcS2 promoter demonstrated to work well in algae. The Hsp70Apromoter serves as a transcriptional activator when placed upstream ofthe RbcS2 promoter enhancing its efficiency. A circuit with a chimeric3×AvrBs-RbcS2 promoter but without TAL effector was used as a negativecontrol. One microgram of each construct (either circular or linearizedDNA) was transformed into Chlamydomonas (1×10⁸ cells) usingelectroporation followed by selection on TAP agar plates containing 10ug/ml Hygromycin B. The selected colonies were assayed for Luciferaseexpression (A) and TAL expression by Western blot analysis (B). For thispurose the colonies were inoculated in 1 ml of TAP medium containing 10μg/ml Hygromycin and incubated at Algal chamber for 4 days and thentransferred to 9 ml TAP medium containing 10 μg/ml Hygromycin. The 10 mlcultures were grown for 3 days and then harvested by centrifugation.Cells were lysed in 300 μl of lysis buffer and cell lysate was obtainedby centrifugation. The lysate was assayed for Luciferase activity usingcoelenterazine as substrate. 30 μl of the supernatants were mixed withSDS sample buffer containing 50 mM DTT. Upon heating at 95° C. for 5minutes, the samples were loaded onto a NuPAGE gel. Afterelectrophoresis, the proteins were transferred to a PVDF membrane usingiBlot. The membrane was blocked for 1 hour and then incubated for 1 hourwith primary mouse anti-His tag antibody at a final concentration of 0.2μg/mL, followed by incubation for 1 hour with secondary goat anti-mouseHRP conjugate antibody at a 1:2000 dilution. The membrane was developedby ECL Chemiluminescent Substrate.

Example 6: TAL-Mediated Activation and Repression of Genes in HumanCells

To analyze TAL-mediated activation or repression activities in humancells, two FLP-IN™ stable 293 cell lines in which a single copy TALresponse cassette was integrated into the genome were established, onewith a GFP reporter driven by an E1B mini promoter, another by afull-length CMV promoter (FIG. 16B). To demonstrate TAL-mediatedactivation, the TAL responsive cell line carrying the E1b-controlled gfpgene were co-transfected with a red fluorescence protein (RFP)expression plasmid as transfection control and one of the followingvectors: pcDNA3.3-AvrBs3, expressing a wild-type TAL effector (TAL),pcDNA3.3-AvrBs3-VP16, expressing a TAL fused to a VP16 activation domain(TAL+VP16), pcDNA3.3-GAL-VP16, expressing a GAL4 binding domain fused toa VP16 domain (GAL+VP16) or an empty vector (Vector only). Cells wereharvested after 48 hours post-transfection and RFP positive cells weregated and analyzed by flow cytometry. As shown in FIG. 15A (left panel)wild-type TAL effectors and TAL-VP16 fusions efficiently activatedreporter gene expression whereas vector only or an irrelevant activator(GAL-VP16) had no effect.

To further demonstrate TAL-mediated repression, the TAL responsive cellline carrying the CMV-controlled GFO gene were co-transfected with a redfluorescence protein (RFP) expression plasmid as transfection controland one of the following vectors: a TAL fused to a KRAB repressor domain(TAL repressor), a Tet repressor (TetR) and an empty vector (FIG. 15A,right panel). Transfected cells were harvested and subject to FACSanalysis. GFP protein expression was significantly reduced in thepresence of the TAL-KRAB but was not impacted by empty vector or anirrelevant Tet reporessor. In sum, these data demonstrate that TALeffectors can be specifically directed to target sites in the mammaliangenome where they are capable of activate or repress gene activity.

Example 7: Transient TAL-Mediated Repression in Mammalian Cells

To evaluate the activity of an engineered TAL repressor in human cells,a TAL repressor was constructed by replacing the C-terminal activationdomain of AvrBs3 with a KRAB domain, the repression domain of a zincfinger protein. A reporter construct harboring a Tet-responsive bindingsite was used as negative control to demonstrate TAL specificity. Thereporter constructs express GFP or LacZ from a full-length CMV promoterharboring a TAL DNA binding sequence or a Tet binding sequence as acontrol. 293FT cells were co-transfected with the AvrBs3-KRAB constructor a Tet construct or empty vector and one of the GFP expressionconstructs harboring either the TAL binding or Tet binding site.Microscopic images of cells were taken 48 h post-transfection for GFPreporter expression (FIG. 16B, left panel). AvrB3-KRAB repressed itscorresponding reporter gene expression but had no effect on Tetresponsive reporter gene expression, suggesting that the repression ofAvrBs3-KRAB is sequence specific.

293 FT cells were transfected with the indicated combination of plasmidsin 96-well plates. Cells were lysed using 100 μl luciferase lysis buffer72 hours post-transfection and the β-galactosidase activity wasdetermined using FluoReporter LacZ/Galactosidase Quantitation kit(F-2905, Life Technologies). Briefly, 2-10 μl of cell lysate per wellwas added to 100 μl of reaction buffer (0.1 M NaPO4, pH7.3, 1 mM MgCL2,45 mM β-mercaptoethanol, 1.1 mM CUG substrate). The reaction wasincubated for 30 min followed by adding 50 μl of stop solution (0.2MNa₂CO₃) to each well. β-galactosidase activity was measured at theexcitation 390 nM, emission 460 nm on Spectramx (FIG. 16B, right panel).The figure is graphed as the percentage of the signal to the pcDNA3control transfection. Co-expression of AvrB3-KRAB specifically repressedthe reporter gene expression about 70% with two copies of TAL DNAbinding sites and around 40% repression with 1 copy of DNA bindingsites, which is comparable with effect of Tet repressor.

Example 8: Assay for Demonstration of Cleavage of Genomic Target DNA byTAL Nucleases

To quantitatively assess the ability of a custom TAL nuclease pair tocleave a specific genomic DNA target a GFP-based cleavage assay wasdeveloped. For this purpose spacers of different lengths (10, 15, 20nucleotides) were inserted into a region of the GFP open reading framethat is known to result in a protein that is still partially functional(Guo et al., J. Mol. Biol., 400:96-107 (2010). However, these spacerswere designed to shift the open reading frame such that a non-functionalprotein is expressed. Three such constructs were generated and were eachindividually incorporated into a single defined location in 293FT cellsusing the Jump In™ targeted integration system (FIG. 17A) to make apanel of stable cell lines bearing a single defective GFP gene. Thesecell lines were created by inserting a mutated EmGFP gene with TALbinding site cassette placed in the loop region as described in Guo, etal. The GFP gene is expressed under the control of the EF1 alphapromoter in the pJTI-R4-Dest vector. This vector was then targeted tothe Jump-In locus in HEK293 JI cells by cotransfection with a vectorexpressing R4 integrase, pJTI-R4-Int using Lipofectamin 2000. Targetedcells were then selected in the presence of neomycin. A TAL nucleasepair was then constructed (ArtXA-FokKK and ArtXB-FokEL) wherein the TALrepeats were designed to bind specific target sites ((T) CTTCT GCACCGGTAT GCG (SEQ ID NO: 113) and (T) ATTCT GGGAC GTTGT ACG (SEQ ID NO:114), respectively) flanking the spacers in the mutated EmGFP openreading frame. The TAL nuclease constructs were LF2K-transfected intothe 293FT cells containing the stably integrated GFP reporters. Bindingof the TAL repeat domain was predicted to result in cleavage by thenuclease domain within the spacer region (see FIG. 17B, upper panel).Upon cleavage at the genomic site, the DNA break would be repaired byendogenous non-homologous end joining pathway which introduces orremoves a small number of nucleotides at the repair site. Therefore itwas expected that the correct translation frame would be restored forGFP expression in about one third of the cases. The partialre-establishment of the GFP ORF resulted in dim green cells that couldbe distinguished from background fluorescence by flow cytometry analysis(see FIG. 17B, lower panel).

Example 9: Sequence Mapping of TALE Nuclease Mediated Genomic LesionsExample 9a

To evaluate successful TAL nuclease-mediated cleavage of genomic targetsequence the following assay was developed: Genomic DNA was isolatedfrom TAL nuclease-treated and untreated cells (FIGS. 20A and B) and wascleaved using a cocktail of common cutters to create ˜100 bp fragments(FIG. 20C). The resulting fragments of both samples are then mixed,melted and cross-hybridized yielding a mixture of correct and mismatcheddouble stranded fragments (FIG. 20D). The fragments are then ligatedwith specific sequencing primers containing a very rare restriction siteat its 5′ end referred to a ‘P1*’ adapter (FIG. 20E). After clean up,the mismatch (indicating the lesion to be identified) is cleaved bytreatment with a PM (Perkinsus) mismatch nuclease (FIG. 20E). However,as understood by the skilled in the art any other suitable MME nucleasecould be used for this purpose, such as, e.g., Cel1 or Res1 nucleases.The cleavage which is limited to the mismatched fragments results in apopulation containing new non-adapted ends (FIG. 20F). The entirepopulation of fragments is then adapted with a second sequencing primerwhich does not contain the rare site in primer 1 referred to as ‘A’adapter (FIG. 20G). The entire population is then treated with the rarecutting enzyme to release primer 2 ligated to primer 1 (FIG. 20H). Thisleaves a population of fragments appended with primer 1 on each end(non-lesion) and fragments with primer 1 on one end and primer 2 ligatedto the lesion site. This population is then subjected to sequencingusing primer 2 to identify the genomic lesion sites.

Example 9b

In a second embodiment genomic lesions were detected according to thefollowing protocol. To extract genomic DNA from TAL nuclease-treated anduntreated Vero E6 cells, two samples of 1×10⁶ cells each were pelletedat 270×g for 5 min. The supernatants were gently removed and cells wereresuspended vigorously in 50 μl of PicoPuro solution (Life Technologies,Carlsbad). The reaction was then transferred to PCR-compatible tubes andextraction of genomic DNA was finished by incubating the sample in a PCRCycler at 68° C. for 15 min and 95° C. for 8 min followed by finalstorage at 4° C. Both samples were then subjected to a PCR amplificationstep in the presence of a primer mix to amplify amplicons containing thepredicted genomic lesion. For this purpose 2 μl of each template genomicDNA sample were mixed in a 50 μl reaction volume with 25 μl of 2× GOLD360 PCR mix (Life Technologies) and 1 μL of a 10 μM primer mix (yielding400 bp amplicons) and were amplified under the PCR conditions providedin TABLE 23 in the presence of the PHUSION® High Fidelity DNA Polymerase(New England Biolabs, Beverly, Mass.):

TABLE 23 amplicon PCR protocol 1 Cycle Number Step ° C. Time 1 1 1 9510:00 2 30 1 95 00:30 2  55* 00:30 3 72 01:00 3 1 1 72 07:00 4 1 1  4 ∞*annealing temperature depends on choice of primers

The PCR products were then purified on spin columns and the OD260 wasmeasured for each sample. Five μl of the PCR products were then run on a1.2% SDS electrophoresis gel to determine whether the PCR was successfuland provided only a single (400 bp) amplicon band.

Two 10-μl cleavage reactions were prepared for each sample (onecontaining an enzyme mix and the other as negative control) eachcontaining 1 μl of a 10× endonuclease reaction buffer (200 mM Tris pH8.3, 5 mM NAD, 250 mM KCl, 100 mM MgCl₂, 0.1% Triton x-100) and 100 ngof PCR product added up to a final volume of 9 μl H₂0. The samples werethen incubated at 98° C. for 2 minutes to quantitatively denature alldouble stranded DNA before samples were cooled to 4° C. for 5 minutes toallow random reassortment of the single strands which causes randomreannealing of the amplicons thereby converting any mutations intomismatched duplex DNA. Reannealing was allowed to finalize at 37° C. for5 minutes and samples were cooled down to 4° C. and stored on ice.

For cleavage of mismatch positions in the reannealed amplicons testsamples (but not control samples) were treated with an enzyme mixcontaining T7 endonuclease I and Taq ligase in an enzyme dilutionbuffer. To obtain 100 μl enzyme composition 10 μl of T7 endonuclease I(10 U/μl) and 10 μl of Taq ligase (40 U/μl) (both New England Biolabs,Beverly, Mass.) were mixed with 80 μl enzyme dilution buffer (10 mM TrispH 7.4 at 4° C., 50 mM KCl, 0.1 mM EDTA, 50% glycerol, 200 ug BSA/ml,0.15% Triton X-100) at 4° C. for 2 hours and subsequently stored at −20°C.

One μl of this enzyme composition was then added to the test samples andthe reactions were incubated at 37° C. for 1 hour in a PCR cycler andthen immediately moved to 4° C. before they were loaded on a 2% EX gel(Life Technologies). The gel was run for approximately 10 minutes beforebands were measured by densitometry and analysed using a gel analysissoftware (IMAGEQUANT™ 5.1, GE Healthcare). To determinemismatch-mediated endonuclease cleavage in the test samples, intensitiesof cleaved bands were determined and divided by total intensity of allmeasured bands. Control reactions without enzyme mix served to determinebackground intensity.

Clause 1. A library of TAL nucleic acid binding cassettes for assemblyof a TAL effector sequence, wherein the library of cassettes contains atleast four different categories of cassettes encoding TAL repeats withall cassettes of one category binding to at least one of the basesadenine, guanine, thymidine, and cytosine in a nucleic acid targetsequence, wherein each cassette is allocated to one or more distinctpositions in the TAL effector sequence and wherein the library ofcassettes contains at least one first cassette per category wherein thenucleotide composition of said first cassette differs from thenucleotide compositions of all other cassettes of the same category andwherein said first cassette is allocated to only one distinct positionin the series of cassettes in the TAL effector sequence.

Clause 2. A library according to clause 1, wherein the library ofcassettes contains at least one second cassette per category wherein thenucleotide composition of said second cassette differs from thenucleotide composition of the first cassette and from the nucleotidecomposition of all other cassettes of the same category and wherein saidsecond cassette is allocated to only one distinct position in the seriesof cassettes in the TAL effector sequence which is different from theposition of the first cassette.

Clause 3. A library according to clause 1 or 2 wherein the one or moredistinct positions within the TAL effector sequence are determined bycomplementary terminal overhangs between cassettes.

Clause 4. A library according to any one of the preceding clauseswherein the TAL effector sequence comprises between 6 and 25 cassettepositions.

Clause 5. A library according to any one of the preceding clauseswherein the TAL effector sequence comprises at least 9 cassettepositions.

Clause 6. A library according to any one of the preceding clauseswherein the TAL effector sequence comprises 17.5 or 18 or 23.5 or 24cassette positions.

Clause 7. A library according to any one of clauses 1 to 3 wherein theTAL effector sequence comprises more than 25 cassette positions.

Clause 8. A library according to any one of the preceding clauseswherein the nucleotide composition of the at least one first cassetteand/or the at least one second cassette differs within a region that ishomologous (i.e. contains an identical nucleotide composition) in allother cassettes of the library. Thus, said homologous region is locatedoutside the terminal ends of the cassettes providing the compatibleoverhangs.

Clause 9. A library according to clause 8 wherein said homologous regionhas a length of at least 10, at least 15, or between 18 and 30nucleotides.

Clause 10. A library according to clause 8 or 9 wherein the nucleotidecomposition of the at least one first cassette and/or the at least onesecond cassette differs within said homologous region by at least 3,preferably at least 4 nucleotides.

Clause 11. A library according to clause 10 wherein the at least 3,preferably at least 4 nucleotides are positioned near the 5′-end or3′-end of said homologous region.

Clause 12. A library according to any one of the preceding clauseswherein said one distinct position of said first cassette is a positionin the center or close to the center of the total amount of cassettepositions (e.g. in a TAL effector sequence with 24 cassette positions aposition in the center or close to the center may include one ofpositions 7 to 18; or in a TAL effector sequence with 18 positions aposition in the center or close to the center may include one ofpositions 4 to 15).

Clause 13. A TAL effector sequence containing a series of TAL nucleicacid binding cassettes selected from one or more of at least fourdifferent categories of cassettes encoding TAL repeats with allcassettes of one category binding to at least one of the bases adenine,guanine, thymidine, and cytosine in a nucleic acid target sequence,wherein the nucleotide composition of at least one first cassette in theseries of cassettes differs from the nucleotide composition of all othercassettes of the same category.

Clause 14. A TAL effector sequence according to clause 13 wherein thenucleotide composition of the at least one first cassette differs withina region of the cassette that is homologous (i.e., contains an identicalnucleotide composition) in all other cassettes of said TAL effectorsequence.

Clause 15. A TAL effector sequence according to clause 14 wherein saidhomologous region has a length of at least 10, at least 15, or between18 and 30 nucleotides.

Clause 16. A TAL effector sequence according to clause 14 or 15, whereinthe nucleotide composition of the at least one first cassette and/or theat least one second cassette differs within said homologous region by atleast 3, preferably at least 4 nucleotides.

Clause 17. A TAL effector sequence according to any one of clauses 13 to16 wherein the at least one first cassette is located in the center orclose to the center of the series of cassettes. (e.g. in a TAL effectorsequence with 24 cassettes the at least one first cassette may belocated at one of positions 7 to 18; or in a TAL effector sequence with18 positions the at least one first cassette may be located at one ofpositions 4 to 15).

Clause 18. A TAL effector sequence according to any one of clauses 13 to16 wherein the TAL effector sequence comprises between 6 and 25cassettes.

Clause 19. A TAL effector sequence according to any one of clauses 13 to16 wherein the TAL effector sequence comprises at least 9 cassettes.

Clause 20. A TAL effector sequence according to any one of clauses 13 to16 wherein the TAL effector sequence comprises more than 25 cassettes.

Clause 21. A TAL effector sequence according to any one of clauses 13 to16 wherein the TAL effector sequence comprises 17.5 or 18 cassettes andwherein the at least one first cassette is located at one of positions 4to 15.

Clause 22. A TAL effector sequence according to any one of clauses 13 to16 wherein the TAL effector sequence comprises 23.5 or 24 cassettes andwherein the at least one first cassette is located at any one ofpositions 7 to 18.

Clause 23. A TAL effector fusion containing a TAL effector sequenceaccording to any one of clauses 13 to 22.

Clause 24. A vector containing a TAL effector sequence according toclauses 13 to 22 or a TAL effector fusion according to clause 23.

Clause 24. A cell containing TAL effector sequence according to clauses13 to 22 or a TAL effector fusion according to clause 23 or a vectoraccording to clause 22.

Clause 25. A method of sequencing a TAL effector sequence according toany one of clauses 13 to 22 wherein said method comprises using at leastone sequencing primer specifically binding to the at least one firstcassette within the TAL effector sequence.

Clause 26. A method according to clause 25 wherein said at least onesequencing primer contains a 3′-end specifically binding to the at leastone first cassette.

Clause 27. A method according to clause 26 wherein the 3′-end of thesequencing primer contains at at least 3, preferably 4 nucleotidepositions determining the binding specificity for the at least one firstcassette.

Clause 28. A method according to clause 27 wherein the 5′ end of the atleast one sequencing primer binds within a region that is homologous(i.e. contains an identical nucleotide composition) in all cassettes ofthe TAL effector sequence.

Another aspect of the invention is further described by the followingset of clauses.

Clause 1. A method of detecting and identifying one or more genomiclocus modifications comprising the steps of

-   -   a) isolating genomic DNA from (i) a cell treated with a TAL        effector nuclease or a pair of TAL effector nucleases and (ii)        an untreated cell,    -   b) cleaving the isolated genomic DNA obtained from both samples        with a mixture of restriction enzymes,    -   c) mixing the samples containing cleaved DNA fragments,    -   d) subjecting the double stranded DNA fragments to a melting an        re-hybridizing procedure,    -   e) ligating the ends of the re-hybridized DNA fragments with a        first double-stranded DNA adapter containing at least one        restrictions enzyme cleavage site at its 5′-end,    -   f) optionally purifying the adapter containing DNA fragments,    -   g) treating the DNA fragments containing the first adapter with        a mismatch cleaving enzyme thereby obtaining a pool of cleaved        and uncleaved DNA fragments,    -   h) ligating the ends of the cleaved and uncleaved DNA fragments        with a second double-stranded DNA adapter lacking the        restriction enzyme cleavage site of the first adapter,    -   i) treating the population of DNA fragments containing the        second adapter with a restriction enzyme specifically cleaving        the at least one restriction enzyme cleavage site at the 5′-end        of the first adapter resulting in the release of the second        adapter,    -   j) optionally separating the population of DNA fragments        containing only the first adapter from the population of DNA        fragments containing a first and a second adapter,    -   k) subjecting at least the population of DNA fragments        containing a first and a second adapter to a sequencing reaction        using the second adapter as primer binding site, and    -   l) identifying the one or more genomic locus modifications.

Clause 2. A method according to clause 1 wherein the mixture ofrestriction enzymes in step b) contains one or more restriction enzymeshaving a four or six base pair recognition sequence.

Clause 3. A method according to clause 1 or 2, wherein the mismatchcleaving enzyme of step g) is selected from the group of Perkinsusmarinus nuclease PA3, Cel1 or Res1.

Clause 4. A method according to any one of the preceding clauses whereinthe restriction enzyme cleaving the restriction enzyme cleavage site instep i) has a seven or eight base pair recognition sequence.

Clause 5. A method according to any one of the preceding clauses whereinthe sequencing reaction in step k) further comprises binding thepopulation of DNA fragments containing a first and a second adapter tobeads using the first adapter as an anchor.

Clause 6. A method according to any one of the preceding clauses whereinstep 1) comprises mapping the sequences obtained in step k) against thegenome of the cell.

Clause 7. A method according to any one of the preceding clauses whereinstep k) comprises personal genome machine (PGM) sequencing.

Another aspect of the invention is further described by the followingset of clauses.

Clause 1. A linear nucleic acid molecule comprising:

-   -   (a) a region encoding an N terminal portion of a TAL effector,    -   (b) a region encoding a C terminal portion of a TAL effector,    -   (c) at least one recombination site, and    -   (d) at least one covalently bound topoisomerase,

wherein the topoisomerase is located at one of the termini of the linearnucleic acid molecule and is within 100 nucleotides of the at least onerecombination site, and

wherein, when the nucleic acid molecule is circularized and contains aTAL repeat located between the termini of the nucleic acid molecule, thecircularized nucleic acid molecule encode a TAL effector which iscapable binding to a specified nucleic acid sequence.

Clause 2. The linear nucleic acid molecule according to clause 1,wherein the linear nucleic acid molecule contains an origin ofreplication.

Clause 3. The linear nucleic acid molecule according to any one of thepreceding clauses, wherein the at least one recombination site isselected from the group consisting of:

-   -   (a) an att site,    -   (b) a lox site, and    -   (c) a frt site.

Clause 4. The linear nucleic acid molecule according to any one of thepreceding clauses, wherein the at least one covalently boundtopoisomerase is a Type IA, Type IB, Type IIA, or Type II topoisomerase.

Clause 5. The linear nucleic acid molecule according to any one of thepreceding clauses, wherein the at least one covalently boundtopoisomerase is a Vaccinia virus topoisomerase.

Another aspect of the invention is further described by the followingset of clauses.

Clause 1. A method for preparing a TAL effector library, the methodcomprising:

-   -   (a) connecting a population of TAL nucleic acid binding        cassettes that individually encode adenine, guanine, thymidine,        or cytosine base binders, when the base is present in a nucleic        acid molecule, and    -   (b) introducing the connected TAL nucleic acid binding cassettes        generated in (a) into a vector to generate a TAL effector        library, wherein the library encodes TAL effectors which bind to        different nucleotide sequences.

Clause 2. The method according to clause 1, wherein TAL nucleic acidbinding cassettes that encode adenine, guanine, thymidine, and cytosinebinders are not all present in equimolar amounts.

Clause 3. The method according to any one of the preceding clauses,wherein TAL nucleic acid binding cassettes that encode adenine andthymine binders are present in equimolar amounts and represent fromabout 51% to about 75% of the total TAL nucleic acid binding cassettespresent.

Clause 4. The method according to any one of the preceding clauses,wherein the TAL effector library encodes TAL effector fusions.

Clause 5. The method according to any one of the preceding clauses,wherein the TAL effector fusion have transcriptional activationactivity.

Clause 6. The method according to any one of the preceding clauses,wherein the TAL effector fusion inhibits transcription.

Clause 7. The method according to any one of the preceding clauses,wherein the vector is a viral vector.

Clause 8. The method according to any one of the preceding clauses,wherein the vector contains at least one recombination site.

Clause 9. The method according to any one of the preceding clauses,wherein the at least one recombination site in an att site.

Clause 10. A TAL effector library prepared by the method according toany one of the preceding clauses.

Another aspect of the invention is further described by the followingset of clauses.

Clause 1. A method for identifying TAL effectors that bind to specifiednucleotide sequences, the method comprising:

-   -   (a) connecting a population TAL nucleic acid binding cassettes        which individually encode TAL subunits that bind to one of the        bases adenine, guanine, thymidine, and cytosine, when the base        is present in a nucleic acid molecule,    -   (b) introducing the connected TAL nucleic acid binding cassettes        generated in (a) into a vector to generate a TAL effector        library, wherein the library contains TAL effectors which bind        to different nucleotide sequences,    -   (c) introducing the TAL effector library into a cell under        conditions which allow for the expression of TAL effectors, and    -   (d) screening the cells generated in (c) to identify cells in        which at least one cellular parameter is altered by expression        of a TAL effector.

Clause 2. The method according to clause 1, wherein the cellularparameter is TAL effector induced transcriptional activation of anon-TAL effector gene.

Clause 3. The method according to any one of the preceding clauses,wherein the cell contains nucleic acid comprising a promoter operablylinked to a reporter and wherein the cellular parameter istranscriptional activation of the reporter.

Clause 4. The method according to any one of the preceding clauses,wherein the reporter is green fluorescent protein.

Clause 5. The method according to any one of the preceding clauses,wherein a TAL effector library member is isolated from a cell in whichat least one cellular parameter is altered by expression of a TALeffector.

Clause 6. A composition comprising a nucleic acid molecule encoding theTAL effector isolated by the method according to any one of thepreceding clauses.

Another aspect of the invention is further described by the followingset of clauses.

Clause 1. A non-naturally occurring protein comprising:

-   -   (a) an amine terminal region of between 25 and 500 amino acids,    -   (b) a carboxyl terminal region of between 25 and 500 amino        acids, and    -   (c) a central region containing five or more amino acid segments        which confer upon the non-naturally occurring protein sequence        specific nucleic acid binding activity,

wherein each of the individual amino acid segments in (c) are between 30and 38 amino acid in length, and

wherein at least one of the amino acid segments is at least 80%identical to one or more of the following amino acid sequences:

 (1) FSQADIVKIAGN, (SEQ ID NO: 37)  (2) GGAQALQAVLDLEP, (SEQ ID NO: 38) (3) GGAQALQAVLDLEPALRERG, (SEQ ID NO: 39)  (4) FRTEDIVQMVS,(SEQ ID NO: 40)  (5) GGSKNLAAVQA, (SEQ ID NO: 41)  (6) GGSKNLEAVQA,(SEQ ID NO: 42)  (7) LEPKDIVSIAS, (SEQ ID NO: 43)  (8) GATQAITTLLNKW,(SEQ ID NO: 44)  (9) GATQAITTLLNKWDXLRAKG, (SEQ ID NO: 45) and (10)GATQAITTLLNKWGXLRAKG; (SEQ ID NO: 46)

wherein X is one of the following amino acids: aspartic acid, serine,alanine, or glutamic acid.

Clause 2. The non-naturally occurring protein according to clause 1,wherein none of the amino acid segments is identical to an amino acidsequence of a TAL protein which naturally occurs in a bacterium of thegenera Xanthomonas or Ralstonia.

Clause 3. The non-naturally occurring protein according to any one ofthe preceding clauses, wherein at least one of the amino acid segmentsis not identical to an amino acid sequence shown in FIG. 30.

Clause 4. The non-naturally occurring protein according to any one ofthe preceding clauses, wherein at least one of the amino acid segmentsis not identical to one of the first eighteen amino acids sequence shownin FIG. 30.

Clause 5. The non-naturally occurring protein according to any one ofthe preceding clauses, wherein the protein is a fusion protein.

Clause 6. The non-naturally occurring fusion protein according to anyone of the preceding clauses, wherein the fusion protein comprises asequence specific nucleic acid binding activity and at least a secondactivity other than sequence specific nucleic acid binding activity.

Clause 7. A nucleic acid molecule comprising a sequence encoding thenon-naturally occurring protein according to any one of the precedingclauses.

Clause 8. A vector comprising the nucleic acid molecule according to anyone of the preceding clauses.

Clause 9. A host cell comprising the nucleic acid molecule according toclause 7 or the vector according to clause 8.

Another aspect of the invention is further described by the followingset of clauses.

Clause 1. A method for generating a population of product cells, themethod comprising:

(a) expressing a TAL-nuclease fusion in a population of starting cellsto generate a sub-population of product cells that have undergonegenetic recombination at a locus containing a detectable marker orselectable marker, wherein the TAL nuclease fusion is designed to bindto and cleave at least two nucleic acid loci in the population ofstarting cells and wherein at least one of the nucleic acid loci encodesthe detectable marker or selectable marker, and

(b) generating the population of product cells by separating the productcells from the population of starting cells or selecting for the productcells.

Clause 2. The method according to clause 1, wherein one of the at leasttwo nucleic acid loci is present on a vector.

Clause 3. The method according to any one of the preceding clauses,wherein one of the at least two nucleic acid loci encodes a detectablemarker.

Clause 4. The method according to any one of the preceding clauses,wherein the nucleic acid locus encoding the detectable marker encodesfurther encodes a selectable marker or a second detectable marker.

Clause 5. The method according to any one of the preceding clauses,wherein the two detectable markers are different fluorescent proteins.

Clause 6. The method according to any one of the preceding clauses,wherein one of the at least two nucleic acid loci encodes a selectablemarker.

Clause 7. The method according to any one of the preceding clauses,wherein the nucleic acid locus encoding the selectable marker encodesfurther encodes a second selectable marker or a detectable marker.

Clause 8. The method according to any one of the preceding clauses,wherein the selectable marker is a negative selectable marker selectablefrom the group consisting of ccdB, Tse2, and Herpes simplex virusthymidine kinase.

Clause 9. The method according to any one of the preceding clauses,wherein the population of product cells is generated by collection ofcells by fluorescence activated cells sorting.

Clause 10. The method according to any one of the preceding clauses,wherein the TAL nuclease fusion is designed to bind to and cleave alocus between a promoter and the detectable marker or selectable marker.

Another aspect of the invention is further described by the followingset of clauses.

Clause 1. A method for the intracellular remodeling of chromatin, themethod comprising expressing a TAL-chromatin modifier fusion in a cell,wherein the TAL nuclease fusion is designed to bind to a nucleic acidlocus in the cell and modify the chromatin at the binding locus.

Clause 2. The method according to clause 1, wherein the chromatinmodifier is a protein having at least one of ATPase, methylase,demethylase, acetylase, or deacetylase activities.

Clause 3. The method according to any one of the preceding clauses,wherein the TAL is fused to all or a portion of one of the followingproteins: SWI2/SNF2, Mi-2, ISWI, BRM, BRG/BAF, Chd-1, Chd-2, Chd-3,Chd-4 and Mot-1.

Clause 4. The method according to any one of the preceding clauses,wherein the cell is an animal cell.

Clause 5. The method according to clause 4, wherein the animal cell is amammalian cell.

While the invention has been described with reference to the specificembodiment thereof, it will be appreciated by those of ordinary skill inthe art that modifications can be made to the structure and elements ofthe invention without departing from the spirit and scope of theinvention as a whole.

U.S. Provisional Patent Application Nos. 61/620,228, filed Apr. 4, 2012,61/644,975, filed May 9, 2012, and 61/784,658, filed Mar. 14, 2013, areincorporated herein by reference in their entireties.

1.-12. (canceled)
 13. A TAL effector sequence containing a series of TALnucleic acid binding cassettes selected from one or more of at leastfour different categories of cassettes encoding TAL repeats with allcassettes of one category binding to at least one of the bases adenine,guanine, thymidine, and cytosine in a nucleic acid target sequence,wherein the nucleotide composition of at least one first cassette in theseries of cassettes differs from the nucleotide composition of all othercassettes of the same category.
 14. A TAL effector sequence according toclaim 13, wherein the nucleotide composition of the at least one firstcassette differs within a region of the cassette that is homologous inall other cassettes of said TAL effector sequence.
 15. A TAL effectorsequence according to claim 14, wherein said homologous region has alength of at least 10, at least 15, or between 18 and 30 nucleotides.16. A TAL effector sequence according to claim 14, wherein thenucleotide composition of the at least one first cassette and/or the atleast one second cassette differs within said homologous region by atleast 3, preferably at least 4 nucleotides. 17.-21. (canceled)
 22. A TALeffector fusion containing a TAL effector sequence according to claim13, wherein the TAL effector fusion is a TAL-nuclease fusion.
 24. Avector containing a TAL effector sequence according to claim
 13. 25.(canceled)
 26. A method of sequencing a TAL effector sequence accordingto claim 13, wherein said method comprises using at least one sequencingprimer specifically binding to the at least one first cassette withinthe TAL effector sequence. 27.-29. (canceled)
 30. A method of detectingand identifying one or more genomic locus modifications comprising thesteps of: a) isolating genomic DNA from (i) a cell treated with a TALeffector nuclease or a pair of TAL effector nucleases and (ii) anuntreated cell, b) cleaving the isolated genomic DNA obtained from bothsamples with a mixture of restriction enzymes, c) mixing the samplescontaining cleaved DNA fragments, d) subjecting the double stranded DNAfragments to a melting an re-hybridizing procedure, e) ligating the endsof the re-hybridized DNA fragments with a first double-stranded DNAadapter containing at least one restrictions enzyme cleavage site at its5′-end, f) optionally purifying the adapter containing DNA fragments, g)treating the DNA fragments containing the first adapter with a mismatchcleaving enzyme thereby obtaining a pool of cleaved and uncleaved DNAfragments, h) ligating the ends of the cleaved and uncleaved DNAfragments with a second double-stranded DNA adapter lacking therestriction enzyme cleavage site of the first adapter, i) treating thepopulation of DNA fragments containing the second adapter with arestriction enzyme specifically cleaving the at least one restrictionenzyme cleavage site at the 5′-end of the first adapter resulting in therelease of the second adapter, j) optionally separating the populationof DNA fragments containing only the first adapter from the populationof DNA fragments containing a first and a second adapter, k) subjectingat least the population of DNA fragments containing a first and a secondadapter to a sequencing reaction using the second adapter as primerbinding site, and l) identifying the one or more genomic locusmodifications.
 31. A method according to claim 30 wherein the mixture ofrestriction enzymes in step b) contains one or more restriction enzymeshaving a four or six base pair recognition sequence. 32.-36. (canceled)37. A method for generating a population of product cells, the methodcomprising: (a) expressing the TAL-nuclease fusion of claim 23 in apopulation of starting cells to generate a sub-population of productcells that have undergone genetic recombination at a locus containing adetectable marker or selectable marker, wherein the TAL nuclease fusionis designed to bind to and cleave at least two nucleic acid loci in thepopulation of starting cells and wherein at least one of the nucleicacid loci encodes the detectable marker or selectable marker, and (b)generating the population of product cells by separating the productcells from the population of starting cells or selecting for the productcells. 38.-46. (canceled)
 47. A method for the intracellular remodelingof chromatin, the method comprising expressing a TAL-chromatin modifierfusion in a cell, wherein the TAL nuclease fusion is designed to bind toa nucleic acid locus in the cell and modify the chromatin at the bindinglocus.
 48. The method of claim 47, wherein the chromatin modifier is aprotein having at least one of ATPase, methylase, demethylase,acetylase, or deacetylase activities. 49.-51. (canceled)
 52. Anon-naturally occurring protein comprising: (a) an amine terminal regionof between 25 and 500 amino acids, (b) a carboxyl terminal region ofbetween 25 and 500 amino acids, and (c) a central region containing fiveor more amino acid segments which confer upon the non-naturallyoccurring protein sequence specific nucleic acid binding activity,wherein each of the individual amino acid segments in (c) are between 30and 38 amino acid in length, and wherein at least one of the amino acidsegments is at least 80% identical to one or more of the following aminoacid sequences:  (1) FSQADIVKIAGN, (SEQ ID NO: 37)  (2) GGAQALQAVLDLEP,(SEQ ID NO: 38)  (3) GGAQALQAVLDLEPALRERG, (SEQ ID NO: 39)  (4)FRTEDIVQMVS, (SEQ ID NO: 40)  (5) GGSKNLAAVQA, (SEQ ID NO: 41)  (6)GGSKNLEAVQA, (SEQ ID NO: 42)  (7) LEPKDIVSIAS, (SEQ ID NO: 43)  (8)GATQAITTLLNKW, (SEQ ID NO: 44)  (9) GATQAITTLLNKWDXLRAKG,(SEQ ID NO: 45) and (10) GATQAITTLLNKWGXLRAKG; (SEQ ID NO: 46)

wherein X is one of the following amino acids: aspartic acid, serine,alanine, or glutamic acid.
 53. The non-naturally occurring protein ofclaim 52, wherein none of the amino acid segments is identical to anamino acid sequence of a TAL protein which naturally occurs in abacterium of the genera Xanthomonas or Ralstonia. 54.-60. (canceled)